Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.12.1, Lustre 2.12.2, Lustre 2.12.3, Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
In several test sessions, racer completes with no test failures, but the test suite fails.
One recent example of this failure is at
https://testing.whamcloud.com/test_sets/b7b5a9a4-9911-11e8-b0aa-52540065bddc
If you look at the test_log, you can see that there is a failure in test-framework.sh
We survived /usr/lib64/lustre/tests/racer/racer.sh for 900 seconds. pid=27203 rc=0 /usr/lib64/lustre/tests/racer.sh: line 51: 5239 Terminated $LUSTRE/tests/racer/lss_create.sh kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] Trace dump: = /usr/lib64/lustre/tests/racer/lss_destroy.sh:1:main() racer: FAIL: test-framework exiting on error /usr/lib64/lustre/tests/racer.sh: line 116: 5241 Terminated $LUSTRE/tests/racer/lss_destroy.sh Cleaning test environment ...
In the client console logs (vm2), we see some errors
/usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer [41111.677988] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000404:0x16:0x0], use llapi_layout_get_by_path() [41116.755137] LustreError: 24455:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000403:0x59:0x0]: -16 [41119.312286] 0[28282]: segfault at 8 ip 00007f770a3c1958 sp 00007ffe220c1600 error 4 in ld-2.17.so[7f770a3b6000+22000] [41126.240557] 16[15401]: segfault at 8 ip 00007f45e23a5958 sp 00007ffc9a44b300 error 4 in ld-2.17.so[7f45e239a000+22000] [41137.883169] Lustre: 24030:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1533454793/real 1533454793] req@ffff9760df0ac600 x1607938745861680/t0(0) o36->lustre-MDT0000-mdc-ffff9760da4b0800@10.9.4.214@tcp:12/10 lens 608/33520 e 0 to 1 dl 1533454800 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [41137.886206] Lustre: lustre-MDT0000-mdc-ffff9760da4b0800: Connection to lustre-MDT0000 (at 10.9.4.214@tcp) was lost; in progress operations using this service will wait for recovery to complete [41137.893377] Lustre: lustre-MDT0000-mdc-ffff9760da4b0800: Connection restored to 10.9.4.214@tcp (at 10.9.4.214@tcp) [41137.895228] Lustre: Skipped 1 previous similar message [41153.890232] 1[2862]: segfault at 8 ip 00007f8e5ca81958 sp 00007ffef3527a80 error 4 in ld-2.17.so[7f8e5ca76000+22000] [41382.588596] 13[20095]: segfault at 8 ip 00007f615b332958 sp 00007ffe28a69740 error 4 in ld-2.17.so[7f615b327000+22000] [41461.276914] LustreError: 10972:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000403:0x831:0x0]: -16 [41790.723927] Lustre: 23148:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1533455409/real 1533455409] req@ffff9760e84a3c00 x1607938748994256/t0(0) o36->lustre-MDT0000-mdc-ffff9760ebc0d800@10.9.4.214@tcp:12/10 lens 608/33520 e 0 to 1 dl 1533455453 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [41790.726799] Lustre: lustre-MDT0000-mdc-ffff9760ebc0d800: Connection to lustre-MDT0000 (at 10.9.4.214@tcp) was lost; in progress operations using this service will wait for recovery to complete [41790.733483] Lustre: lustre-MDT0000-mdc-ffff9760ebc0d800: Connection restored to 10.9.4.214@tcp (at 10.9.4.214@tcp) [41886.854201] 10[15730]: segfault at 8 ip 00007fb1ccb98958 sp 00007ffe20b24c50 error 4[41886.854226] 10[15008]: segfault at 8 ip 00007f91e0c82958 sp 00007ffc925a0670 error 4 in ld-2.17.so[7f91e0c77000+22000] [41886.856167] in ld-2.17.so[7fb1ccb8d000+22000] [41890.860922] LustreError: 20803:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000403:0xf89:0x0]: -16 [41907.391169] 16[18590]: segfault at 8 ip 00007fb0cd822958 sp 00007ffe5c9d9a60 error 4 in ld-2.17.so[7fb0cd817000+22000]
In the client console logs (vm1), we see some errors
[41113.825182] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000402:0x50:0x0], use llapi_layout_get_by_path() [41119.485080] 13[7719]: segfault at 8 ip 00007fb755e1b958 sp 00007fffe348efb0 error 4 in ld-2.17.so[7fb755e10000+22000] [41222.166818] 15[9481]: segfault at 8 ip 00007fedc19f0958 sp 00007fff92668070 error 4 in ld-2.17.so[7fedc19e5000+22000] [41386.590121] 7[9668]: segfault at 8 ip 00007ff3a2be0958 sp 00007ffe4c639400 error 4 in ld-2.17.so[7ff3a2bd5000+22000] [41567.653415] 0[4503]: segfault at 8 ip 00007f18555e2958 sp 00007ffe555ae4a0 error 4 in ld-2.17.so[7f18555d7000+22000] [41623.843544] 4[29403]: segfault at 8 ip 00007fe31436a958 sp 00007fff42768840 error 4 in ld-2.17.so[7fe31435f000+22000] [41994.067094] 5[20666]: segfault at 0 ip 0000000000403e5f sp 00007ffc69d4a4e0 error 6 in 5[400000+6000] [42082.264466] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
So far, this is only seen with ZFS testing.
I think all these error messages have been reported in other racer test failure tickets.
Here are other failures like this
https://testing.whamcloud.com/test_sets/d3254156-9743-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/eace6cec-961c-11e8-8ee3-52540065bddc
https://testing.whamcloud.com/test_sets/688ba02e-90f9-11e8-87f3-52540065bddc
https://testing.whamcloud.com/test_sets/3a37c110-860b-11e8-808e-52540065bddc
https://testing.whamcloud.com/test_sets/df9509ec-6b99-11e8-a522-52540065bddc