Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.8.0
-
None
-
review-dne-part-2 in autotest
-
3
-
9223372036854775807
Description
After all the tests in sanity-lfsck run, the suite hangs at umount for /mnt/mds4. All subtests are marked as PASS. Logs are at https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2
In the suite_stdout log, the last thing we see is
13:48:04:CMD: shadow-23vm8 grep -c /mnt/mds4' ' /proc/mounts 13:48:04:Stopping /mnt/mds4 (opts:-f) on shadow-23vm8 13:48:04:CMD: shadow-23vm8 umount -d -f /mnt/mds4 14:47:53:********** Timeout by autotest system **********
Looking at the logs for the last test run, sanity-lfsck test_31h, it’s clear that something is not functioning correctly. What went wrong isn’t that clear to me. From the MDS2, 3, 4 console, it’s clear there’s problems communicating between the MDTs more than normal for this test:
14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping) 14:23:44:Lustre: Skipped 226 previous similar messages 14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 14:23:44:LustreError: Skipped 585 previous similar messages 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880351/real 1443880351] req@ffff880058aa2080 x1514012107003420/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443880376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 50 previous similar messages 14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping) 14:23:44:Lustre: Skipped 475 previous similar messages 14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 14:23:44:LustreError: Skipped 1191 previous similar messages 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880961/real 1443880961] req@ffff880058a93cc0 x1514012107006512/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443880986 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages 14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping) 14:23:44:Lustre: Skipped 475 previous similar messages 14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 14:23:44:LustreError: Skipped 1191 previous similar messages 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443881581/real 1443881581] req@ffff88007cb1fc80 x1514012107009652/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443881606 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages 14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping) 14:23:44:Lustre: Skipped 475 previous similar messages 14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 14:23:44:LustreError: Skipped 1189 previous similar messages 14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443882191/real 1443882191] req@ffff88006705b980 x1514012107012736/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443882216 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
There are also stack traces in the console logs for all nodes for test 31h, but there are no Lustre function calls anywhere close to the top of the stack traces.
There have been three recent occurrences of this issues with logs at:
https://testing.hpdd.intel.com/test_sets/42501ed6-69be-11e5-9fbf-5254006e85c2
https://testing.hpdd.intel.com/test_sets/60742a32-6a19-11e5-9fbf-5254006e85c2
https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2
all on 2015-10-03 in review-dne-part-2.
Attachments
Issue Links
- is duplicated by
-
LU-7793 sanity-lfsck test_23a fails with ‘(10) unexpected status'
- Closed
- is related to
-
LU-7221 replay-ost-single test_3: ASSERTION( __v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a) ) failed: value: 0
- Resolved
-
LU-7648 split lctl lfsck sub-commands to new man pages
- Resolved
- mentioned in
-
Page Loading...