[LU-4181] lnet-selftest test_smoke: lst error found Created: 29/Oct/13  Updated: 13/Oct/16  Resolved: 05/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.7.0, Lustre 2.4.3, Lustre 2.8.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Isaac Huang (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4, yuc2

Issue Links:
Related
is related to LU-8705 fix lnet-selftest to be not skipped o... Resolved
Severity: 3
Rank (Obsolete): 11318

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/794b62ea-40b4-11e3-af95-52540035b04c.

The sub-test test_smoke failed with the following error:

Total 1 error nodes in c
Total 3 error nodes in s
lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found

Info required for matching: lnet-selftest smoke



 Comments   
Comment by Peter Jones [ 30/Oct/13 ]

Amir

Could you please comment on this one?

Thanks

Peter

Comment by Di Wang [ 31/Oct/13 ]

Just disable lnet_selftest now for DNE http://review.whamcloud.com/8130

Comment by Andreas Dilger [ 01/Nov/13 ]

It looks like lnet-selftest has only been failing since 2013-10-25, but only for review-dne and review-dne-zfs tests.

In https://maloo.whamcloud.com/test_sets/794b62ea-40b4-11e3-af95-52540035b04c it looks like there is some kind of communication problem between the DNE nodes, possibly because there are more of nodes? It seems some of the nodes are having problems with .218, except for the one that reports problems with .215.

MDS 1 (wtm-21vm3):

02:51:19:LNet: 12592:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.218@tcp, timeout 64.
02:51:19:LNet: 12592:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.218@tcp, timeout 64.
02:51:19:LustreError: 12591:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110
02:51:19:LustreError: 12591:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110

Client 2 (wtm-21vm6):

02:24:09:LNet: 18251:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.215@tcp, timeout 64.
02:24:09:LNet: 18251:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.215@tcp, timeout 64.
02:24:09:LustreError: 18249:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.215@tcp failed with -110
02:24:09:LustreError: 18249:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.215@tcp failed with -110
02:24:09:LustreError: 18249:0:(ping_test.c:135:ping_client_done_rpc()) Unable to ping 12345-10.10.16.220@tcp (21): -110
02:24:09:LustreError: 18249:0:(ping_test.c:135:ping_client_done_rpc()) Unable to ping 12345-10.10.16.220@tcp (22): -110
02:51:29:Lustre: 18248:0:(ping_test.c:82:ping_client_fini()) 16 pings have failed.
02:51:29:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 

OST 9-16 (wtm-21vm8):

02:51:15:LNet: 581:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.218@tcp, timeout 64.
02:51:15:LustreError: 580:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110
02:51:15:LustreError: 580:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110
02:51:15:Lustre: 579:0:(ping_test.c:82:ping_client_fini()) 8 pings have failed.
02:51:15:Lustre: 579:0:(ping_test.c:82:ping_client_fini()) 8 pings have failed.

MDS 4 (wtm-21vm1):

02:51:15:LNet: 28972:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.218@tcp, timeout 64.
02:51:15:LNet: 28972:0:(rpc.c:1119:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.10.16.218@tcp, timeout 64.
02:51:15:LustreError: 28971:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110
02:51:15:LustreError: 28971:0:(brw_test.c:333:brw_client_done_rpc()) BRW RPC to 12345-10.10.16.218@tcp failed with -110
02:51:15:Lustre: 28970:0:(ping_test.c:82:ping_client_fini()) 5 pings have failed.
02:51:15:Lustre: 28970:0:(ping_test.c:82:ping_client_fini()) 5 pings have failed.
Comment by Andreas Dilger [ 15/Nov/13 ]

Amir, any chance to look at this failure?

Comment by Amir Shehata (Inactive) [ 16/Nov/13 ]

Is there a way to rerun the test and add some debug statements in the test. the test gets the list of nids of servers and clients, I want to make sure that this list is sane and what is expected. Could their be some assumptions based on the order of the nids, that is no longer true for DNE systems?

Comment by Andreas Dilger [ 16/Nov/13 ]

Amir, the best bet would be to submit a patch to lnet-selftest.sh that adds the debugging you want and the run the test with:

Test-Parameters: fortestonly testlist=lnet-selftest mdscount=4 osscount=2

Or similar (please check wiki page to verify exact parameter names).

Comment by Andreas Dilger [ 18/Nov/13 ]

The patch to disable lnet-selftest for DNE shows as landed on 11-08, but I still see four failures in review-dne in the past few days. Is that because those patches have not been rebased, or is the test not being skipped for some reason?

Comment by Bob Glossman (Inactive) [ 22/Nov/13 ]

maloo says https://maloo.whamcloud.com/test_sets/2d3946f0-539d-11e3-9901-52540035b04c is this bug, but it was seen in review not in review-dne

Comment by Amir Shehata (Inactive) [ 22/Nov/13 ]

Discussed this issue with Isaac, and we're suspecting that it could be a network issue. In the original report there is many ping failures, which would indicate that one of the clients is unable to communicate with the server. Is there a way to hop onto the test system that's having the problem and try and do lctl ping from the client to the server.
Beside that, I'm making a few changes to lnet-selftest to dump out more debug information, so I can tell if one of the clients is simply failing to communicate with the servers.

Comment by Jian Yu [ 19/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
MDSCOUNT=4

The same failure occurred:
https://maloo.whamcloud.com/test_sets/6af5f312-6879-11e3-a9a3-52540035b04c

Comment by Andreas Dilger [ 07/Jan/14 ]

Since Chris changed the review-dne node configuration to 2x MDS + 1x OSS on 2013-12-26 as part of TEI-1312, it seems this is allowing lnet-selftest can pass again (on b2_5 where it is not excluded):

https://maloo.whamcloud.com/test_sets/b040595e-7269-11e3-ab15-52540035b04c

It is probably worthwhile to submit a patch to re-enable this test to see if it is now working for review-dne runs on master.

Comment by Sarah Liu [ 13/Jan/14 ]

I just submit a patch to re-enable it: http://review.whamcloud.com/#/c/8823/

Comment by Sarah Liu [ 17/Jan/14 ]

the following result shows that this test can pass with current DNE config

https://maloo.whamcloud.com/test_sessions/4753c032-7dab-11e3-91f7-52540035b04c

Comment by Andreas Dilger [ 18/Jan/14 ]

So it seems the problem is due to the number of nodes in the old review-dne config. We have since reduced the node count in the review-dne config to just one more than review-ldiskfs.

However, since review-dne already takes much more time to run than review-ldiskfs (14h vs. 3h) it doesn't make sense to enable this on review-dne at this time, since the testing is equivalent.

Comment by Sarah Liu [ 28/Feb/14 ]

Hit this error in zfs testing

https://maloo.whamcloud.com/test_sets/9235abcc-9f18-11e3-934b-52540035b04c

Comment by Jian Yu [ 17/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/73/ (2.4.3 RC1)
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=4

https://maloo.whamcloud.com/test_sets/00d71476-abee-11e3-bcad-52540035b04c

Comment by Bob Glossman (Inactive) [ 07/Apr/14 ]

another, in b2_5
https://maloo.whamcloud.com/test_sets/0cf3118c-be8c-11e3-b5bd-52540035b04c

Comment by Amir Shehata (Inactive) [ 23/Sep/14 ]

Digging through some of the dmesg log files I see the error messages below. Some of these errors are repeated all over the the log. Probably will need someone more familiar with the lustre codebase to take a look and determine the cause of these errors.

LustreError: 11-0: lustre-MDT0000-mdc-ffff880037c15400: Communicating with 10.2.4.158@tcp, operation mds_reint failed with -19.
LustreError: 17968:0:(file.c:171:ll_close_inode_openhandle()) inode 144115205272520296 mdc close failed: rc = -4
LustreError: 166-1: MGC10.2.4.158@tcp: Connection to MGS (at 10.2.4.158@tcp) was lost; in progress operations using this service will fail
LustreError: 21700:0:(lmv_obd.c:1524:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880037c15400), error -5
LustreError: 21700:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -5
LustreError: 167-0: lustre-OST0000-osc-ffff880037c15400: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
LustreError: 11-0: MGC10.2.4.158@tcp: Communicating with 10.2.4.158@tcp, operation obd_ping failed with -107.
LustreError: 26751:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 801 sleeping for 20000ms
LustreError: 26751:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 801 awake
LustreError: 11-0: lustre-OST0000-osc-ffff8800701a3400: Communicating with 10.2.4.159@tcp, operation ost_connect failed with -16.

LustreError: 167-0: lustre-OST0000-osc-ffff8800701a3400: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
LustreError: 29724:0:(osc_lock.c:833:osc_ldlm_completion_ast()) lock@ffff88007ac36ed0[3 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x100000000:0xe22:0x0] {
LustreError: 29724:0:(osc_lock.c:833:osc_ldlm_completion_ast())     lovsub@ffff88006cec7a20: [0 ffff88007bd6a470 W(2):[0, 18446744073709551615]@[0x200002b11:0x14:0x0]] 
LustreError: 29724:0:(osc_lock.c:833:osc_ldlm_completion_ast())     osc@ffff88006efaca28: ffff88007cf06d40    0x20040000001 0x27bfcba3c3cbb9cb 3 ffff8800698c9dd8 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
LustreError: 29724:0:(osc_lock.c:833:osc_ldlm_completion_ast()) } lock@ffff88007ac36ed0
LustreError: 29724:0:(osc_lock.c:833:osc_ldlm_completion_ast()) dlmlock returned -5
LustreError: 29724:0:(ldlm_resource.c:809:ldlm_resource_complain()) lustre-OST0000-osc-ffff8800701a3400: namespace resource [0xe22:0x0:0x0].0 (ffff88006af141c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 29724:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0xe22:0x0:0x0].0 (ffff88006af141c0) refcount = 2
LustreError: 29724:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
LustreError: 29724:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: lustre-OST0000-osc-ffff8800701a3400 lock: ffff88007cf06d40/0x27bfcba3c3cbb9cb lrc: 4/0,1 mode: PW/PW res: [0xe22:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x126400000000 nid: local remote: 0xbb105070585d6d1c expref: -99 pid: 29703 timeout: 0 lvb_type: 1
LustreError: 29703:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlikely!
Lustre: lustre-OST0000-osc-ffff8800701a3400: Connection restored to lustre-OST0000 (at 10.2.4.159@tcp)
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked()) lock@ffff88007ac36df8[2 0 0 1 0 00000000] W(2):[0, 18446744073709551615]@[0x200002b11:0x14:0x0] {
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked())     vvp@ffff880037c86650: 
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked())     lov@ffff88007bd6a470: 1
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked())     0 0: ---
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked()) 
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked()) } lock@ffff88007ac36df8
LustreError: 29703:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return -5

LustreError: 9132:0:(mdc_locks.c:918:mdc_enqueue()) ldlm_cli_enqueue: -2
Comment by Doug Oucharek (Inactive) [ 01/Oct/14 ]

I thought when we test lnet_selftest, we are only loading the modules lnet and lnet_selftest. Those errors are from Lustre. What is it doing running during an lnet_selftest?

Comment by Amir Shehata (Inactive) [ 03/Oct/14 ]

Continuing to look at the LustreErrors in the dmesg on the client on the latest failures. Currently looking at:
https://maloo.whamcloud.com/test_sessions/2bf8aa7e-4ad7-11e4-8d48-5254006e85c2

LustreError: 11-0: lustre-MDT0000-mdc-ffff88007c2b8800: Communicating with 10.2.4.131@tcp, operation obd_ping failed with -107. 
## 107 ENOTCONN
LustreError: 166-1: MGC10.2.4.131@tcp: Connection to MGS (at 10.2.4.131@tcp) was lost; in progress operations using this service will fail
LustreError: 2373:0:(client.c:2804:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88007bb64c00 x1480913582033212/t4294967470(4294967470) o101->lustre-MDT0000-mdc-ffff88007c2b8800@10.2.4.131@tcp:12/10 lens 632/544 e 0 to 0 dl 1412309449 ref 2 fl Interpret:RP/4/0 rc 301/301
LustreError: 5833:0:(mgc_request.c:517:do_requeue()) failed processing log: -5
LustreError: 11-0: lustre-MDT0000-mdc-ffff88007c2b8800: Communicating with 10.2.4.131@tcp, operation ldlm_enqueue failed with -95.

LustreError: 22804:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 412 sleeping for 1000ms
LustreError: 22804:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 412 awake

LustreError: 29859:0:(osc_request.c:1983:osc_build_rpc()) prep_req failed: -12

LustreError: 133-1: lustre-OST0004-osc-ffff88007c2b8800: BAD READ CHECKSUM: from 10.2.4.132@tcp inode [0x0:0x0:0x0] object 0x0:1925 extent [0-1048575]

These errors are repeated through out the log.

One observation, is that all the errors are on the tcp network not IB. I'm not familiar with the test configuration, so that could simply be because the test runs only on tcp network.

On the server side I see the following LustreErrors:

LustreError: 11-0: lustre-MDT0000-lwp-OST0000: Communicating with 10.2.4.95@tcp, operation mds_connect failed with -16.
LustreError: 11-0: MGC10.2.4.95@tcp: Communicating with 10.2.4.95@tcp, operation obd_ping failed with -107.
LustreError: 137-5: lustre-OST0006_UUID: not available for connect from 10.2.4.98@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: 31991:0:(ldlm_lib.c:2106:target_stop_recovery_thread()) lustre-OST0000: Aborting recovery
LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-10.2.4.98@tcp inode [0x200000bd0:0xe626:0x0] object 0x0:1899 extent [0-1048575]: client csum f1e7c95b, server csum f1e7c95a
Comment by Amir Shehata (Inactive) [ 09/Oct/14 ]

We created a patch, where lnet-selftest.sh unmounts the file system (which it already does) and and unload the lustre modules (new), then ran lnet-selftest.sh on it 22 times. And it passed. I would like to commit that change to see if resolves the errors that are occuring.

http://review.whamcloud.com/#/c/12214/
https://testing.hpdd.intel.com/test_sessions/f9be7532-4f0b-11e4-92ea-5254006e85c2

Comment by Amir Shehata (Inactive) [ 16/Oct/14 ]

looked at the test failure from Jian Yu on 4/Oct/14 and it looks like there are many 108 (ESHUTDOWN) errors on one of the nodes, which could indicate that the lnet/self-test modules are being unloaded before the test is completed. This would result in the lst failure.

Does the test system attempt to unload modules after the test? Could explain the errors. Ad mentioned before, when I an lnetselftest on its own 20 consecutive times, everything works.

Comment by John Hammond [ 17/Oct/14 ]

Please see http://review.whamcloud.com/12332.

Comment by Jodi Levi (Inactive) [ 29/Oct/14 ]

http://review.whamcloud.com/#/c/12469/

Comment by nasf (Inactive) [ 31/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/034d0816-60b4-11e4-a66b-5254006e85c2

Comment by nasf (Inactive) [ 01/Nov/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/94b38298-61a0-11e4-8b49-5254006e85c2

Comment by Jodi Levi (Inactive) [ 04/Nov/14 ]

Patch landed to Master.

Comment by John Hammond [ 12/Nov/14 ]

The patch landed was a debug patch.

Comment by John Hammond [ 12/Nov/14 ]

I saw an instance of this today with the new debug patch.

https://testing.hpdd.intel.com/test_sets/8f6d3750-6a53-11e4-a7b0-5254006e85c2
https://testing.hpdd.intel.com/test_logs/9b2922d4-6a53-11e4-a7b0-5254006e85c2

00000001:02000400:1.0:1415666453.647581:0:10468:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: -----============= acceptance-small: lnet-selftest ============----- Tue Nov 11 00:40:53 UTC 2014
00000001:02000400:1.0:1415666454.125144:0:10613:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark == lnet-selftest test complete, duration -o sec == 00:40:53 \(1415666453\)
00000001:02000400:0.0:1415666454.269086:0:10672:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == lnet-selftest test complete, duration -o sec == 00:40:53 (1415666453)
00000001:02000400:0.0:1415666454.844848:0:10856:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
00000001:02000400:0.0:1415666455.483429:0:11174:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark excepting tests:
00000001:02000400:1.0:1415666455.618475:0:11232:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: excepting tests:

00000001:02000400:0.0:1415666458.017200:0:11346:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark == lnet-selftest test smoke: lst regression test == 00:40:57 \(1415666457\)
00000001:02000400:0.0:1415666458.154280:0:11404:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: == lnet-selftest test smoke: lst regression test == 00:40:57 (1415666457)
00000001:02000400:0.0:1415666458.424825:0:11506:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
00000001:02000400:1.0:1415666459.001668:0:11759:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u

00000400:00020000:1.0:1415666764.476159:0:414:0:(rpc.c:1470:srpc_lnet_ev_handler()) ev->status = -125, ev->type = 5, errors = 1, rpcs_sent = 17427, rpcs_rcvd = 16841, rpcs_dropped = 0, rpcs_expired = 0
00000001:02000400:1.0:1415666768.744246:0:12074:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found
00000001:02000400:1.0:1415666768.878468:0:12132:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found
00000001:02000400:1.0:1415666769.107358:0:12233:0:(debug.c:345:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2014-11-10/lustre-reviews-el6-x86_64--review-ldiskfs--2_4_1__28281__-70078148216900-145241/lnet-selftest.test_smoke.debug_log.$(hostname -\
s).1415666768.log;

ev->type == 5 == LNET_EVENT_SEND

t:lustre-release# errno 125
#define ECANCELED       125     /* Operation Canceled */

Since the status is -ECANCELED and is reported about 300 seconds after the start of test smoke this supports the idea that the error is due to the test being shutdown.

Comment by Jian Yu [ 20/Nov/14 ]

More instance on master: https://testing.hpdd.intel.com/test_sets/c9b21b00-70bc-11e4-95d2-5254006e85c2

Comment by Amir Shehata (Inactive) [ 24/Nov/14 ]

I looked at the other instances of these issues over the past few days and it appears that the failure is exactly the same across the different tests:

ev->status = -125 and ev->type = 5

Also the failure rate hasn't really changed after the patch landed. Approximately, 1 failure/day

ECANCELED is set when the operation is canceled either due to a test being shutdown or selftest module being unloaded. I don't think there is any issues with selftest, as far as I could tell.

Another possibility is that it could be a test system issue.

All occurrences of the debug message added is towards the end of the log file. Could it be that the test system is shutting down the test after it thinks it is complete, but the traffic hasn't been stopped yet, causing this error to occur?

Comment by nasf (Inactive) [ 30/Nov/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/8f027262-76ce-11e4-b1ab-5254006e85c2

Comment by Jian Yu [ 01/Dec/14 ]

Could it be that the test system is shutting down the test after it thinks it is complete, but the traffic hasn't been stopped yet, causing this error to occur?

In lnet-selftest.sh test_smoke():

    run_lst $runlst | tee $log
    rc=${PIPESTATUS[0]}
    [ $rc = 0 ] || error "$runlst failed: $rc"
    
    lst_end_session --verbose | tee -a $log

We can add some codes before running lst_end_session to make sure the sessions are really finished. And it looks like the changes need to be made in the following codes in test_smoke_sub():

    echo $LST run b
    echo sleep 1
    echo "$LST stat --delay 10 --timeout 10 c s &"
    echo 'pid=$!'
    echo 'trap "cleanup $pid" INT TERM'
    echo sleep $smoke_DURATION
    echo 'cleanup $pid'
Comment by Jodi Levi (Inactive) [ 12/Dec/14 ]

Isaac,
Can you please take this one?
Thank you

Comment by Isaac Huang (Inactive) [ 19/Dec/14 ]

I believe the root cause was that there's no way for a script to know whether tests are still running when end_session is called. In other words a script can't figure out when to end_session. The lst was mainly designed for interactive use - when the r/w stats dwindle down to near zero, the tests have completed and it's time for further actions (e.g. running more tests, or end_session and stop).

The fix here is not to count it as an error, if lst tests are stopped from administrative actions (e.g. end_session from the remote test console, or local root user trying to unload the lnet_selftest kernel module). Such aren't network errors, and shouldn't be counted anyway.

Increasing smoke_DURATION would reduce the chances of hitting such bogus errors but not eliminate them. It also comes at the cost of increasing test run time unnecessarily. So I wouldn't recommend it.

Comment by Andreas Dilger [ 05/Jan/15 ]

Since test_smoke is just to verify that lnet selftest is working, it would probably be better to reduce the amount of work that is being done, instead of increasing the test duration further (it is already taking 2000s to complete).

It might be possible to make the test size conditional on whether it is running in a VM or not.

Comment by Sarah Liu [ 05/Jan/15 ]

Hit this in tag-2.6.92 zfs testing:

https://testing.hpdd.intel.com/test_sets/d45d80fc-9363-11e4-b7aa-5254006e85c2

Comment by Gerrit Updater [ 07/Jan/15 ]

Isaac Huang (he.huang@intel.com) uploaded a new patch: http://review.whamcloud.com/13279
Subject: LU-4181 lnet_selftest: bogus lst errors
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 827440ea6db076aceb2be3a65fd17cbcdcee0c2e

Comment by Gerrit Updater [ 16/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13279/
Subject: LU-4181 lnet_selftest: bogus lst errors
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8cfa7f80d3975298d5738091da7d9c28d6f5c9f5

Comment by Sarah Liu [ 19/Jan/15 ]

Didn't see this issue in build #2821(include the fix) and #2822, verified.

Comment by Peter Jones [ 19/Jan/15 ]

Thanks Sarah.

Comment by Sarah Liu [ 13/Apr/15 ]

Hit this master branch build #2983 with EL7 client
https://testing.hpdd.intel.com/test_sets/f7370aee-df8a-11e4-bf2e-5254006e85c2

[LNet Rates of c]
[R] Avg: 272      RPC/s Min: 272      RPC/s Max: 272      RPC/s
[W] Avg: 265      RPC/s Min: 265      RPC/s Max: 265      RPC/s
[LNet Bandwidth of c]
[R] Avg: 23.06    MB/s  Min: 23.06    MB/s  Max: 23.06    MB/s
[W] Avg: 20.04    MB/s  Min: 20.04    MB/s  Max: 20.04    MB/s
killing 5477 ...
RPC failure, can't show error on 12345-10.2.4.205@tcp
12345-10.2.4.206@tcp: [Session 0 brw errors, 0 ping errors] [RPC: 2 errors, 0 dropped, 0 expired]
c:
Total 2 error nodes in c
12345-10.2.4.203@tcp: [Session 120 brw errors, 15 ping errors] [RPC: 62 errors, 58 dropped, 131 expired]
12345-10.2.4.204@tcp: [Session 120 brw errors, 15 ping errors] [RPC: 1 errors, 0 dropped, 135 expired]
s:
Total 2 error nodes in s
stop batch RPC failed on 12345-10.2.4.205@tcp: Unknown error -110
session is ended
Total 2 error nodes in c
Total 2 error nodes in s
 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
Comment by Isaac Huang (Inactive) [ 13/Apr/15 ]

This appeared to be a different issue: while previously the errors were bogus, here there were more than one hundred expired RPCs.

Comment by Andreas Dilger [ 05/Jan/16 ]

Close this old ticket again. New test failures are being tracked under LU-6622.

Generated at Sat Feb 10 01:40:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.