[LU-7256] sanity-lfsck TIMEOUT on umount /mnt/mds4 Created: 05/Oct/15  Updated: 28/Feb/17  Resolved: 14/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

review-dne-part-2 in autotest


Issue Links:
Duplicate
is duplicated by LU-7793 sanity-lfsck test_23a fails with ‘(10... Closed
Related
is related to LU-7221 replay-ost-single test_3: ASSERTION( ... Resolved
is related to LU-7648 split lctl lfsck sub-commands to new ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After all the tests in sanity-lfsck run, the suite hangs at umount for /mnt/mds4. All subtests are marked as PASS. Logs are at https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2

In the suite_stdout log, the last thing we see is

13:48:04:CMD: shadow-23vm8 grep -c /mnt/mds4' ' /proc/mounts
13:48:04:Stopping /mnt/mds4 (opts:-f) on shadow-23vm8
13:48:04:CMD: shadow-23vm8 umount -d -f /mnt/mds4
14:47:53:********** Timeout by autotest system **********

Looking at the logs for the last test run, sanity-lfsck test_31h, it’s clear that something is not functioning correctly. What went wrong isn’t that clear to me. From the MDS2, 3, 4 console, it’s clear there’s problems communicating between the MDTs more than normal for this test:

14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
14:23:44:Lustre: Skipped 226 previous similar messages
14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
14:23:44:LustreError: Skipped 585 previous similar messages
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880351/real 1443880351]  req@ffff880058aa2080 x1514012107003420/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443880376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 50 previous similar messages
14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
14:23:44:Lustre: Skipped 475 previous similar messages
14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
14:23:44:LustreError: Skipped 1191 previous similar messages
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443880961/real 1443880961]  req@ffff880058a93cc0 x1514012107006512/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443880986 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages
14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
14:23:44:Lustre: Skipped 475 previous similar messages
14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
14:23:44:LustreError: Skipped 1191 previous similar messages
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443881581/real 1443881581]  req@ffff88007cb1fc80 x1514012107009652/t0(0) o250->MGC10.1.5.33@tcp@10.1.5.33@tcp:26/25 lens 520/544 e 0 to 1 dl 1443881606 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) Skipped 40 previous similar messages
14:23:44:Lustre: lustre-MDT0003: Not available for connect from 10.1.5.32@tcp (stopping)
14:23:44:Lustre: Skipped 475 previous similar messages
14:23:44:LustreError: 137-5: lustre-MDT0001_UUID: not available for connect from 10.1.5.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
14:23:44:LustreError: Skipped 1189 previous similar messages
14:23:44:Lustre: 4063:0:(client.c:2092:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1443882191/real 1443882191]  req@ffff88006705b980 x1514012107012736/t0(0) o38->lustre-MDT0000-osp-MDT0003@10.1.5.33@tcp:24/4 lens 520/544 e 0 to 1 dl 1443882216 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

There are also stack traces in the console logs for all nodes for test 31h, but there are no Lustre function calls anywhere close to the top of the stack traces.

There have been three recent occurrences of this issues with logs at:
https://testing.hpdd.intel.com/test_sets/42501ed6-69be-11e5-9fbf-5254006e85c2
https://testing.hpdd.intel.com/test_sets/60742a32-6a19-11e5-9fbf-5254006e85c2
https://testing.hpdd.intel.com/test_sets/fb298f72-6a05-11e5-9d0a-5254006e85c2
all on 2015-10-03 in review-dne-part-2.



 Comments   
Comment by Saurabh Tandan (Inactive) [ 13/Oct/15 ]

Another instance but in this case it TIMEOUT on umount /mnt/ost7:
https://testing.hpdd.intel.com/test_sets/9b283adc-6dee-11e5-b960-5254006e85c2

Comment by James Nunez (Inactive) [ 19/Oct/15 ]

sanity-lfsck is timing out on unmount of mds4 frequently. Here are some recent logs:
2015-10-12 06:00:57 - https://testing.hpdd.intel.com/test_sets/e61e890c-70db-11e5-95b7-5254006e85c2
2015-10-13 21:47:32 - https://testing.hpdd.intel.com/test_sets/efa762d4-7228-11e5-b344-5254006e85c2
2015-10-16 18:41:32 - https://testing.hpdd.intel.com/test_sets/3876f938-7487-11e5-b47d-5254006e85c2
2015-10-16 20:01:49 - https://testing.hpdd.intel.com/test_sets/d52a8c14-7477-11e5-8f32-5254006e85c2
2015-10-17 22:08:38 - https://testing.hpdd.intel.com/test_sets/a70cb296-7551-11e5-b12f-5254006e85c2
2015-10-26 12:08:40 - https://testing.hpdd.intel.com/test_sets/5a92f46c-7c12-11e5-88cf-5254006e85c2
2015-10-28 16:28:57 - https://testing.hpdd.intel.com/test_sets/6070d91c-7dca-11e5-bca9-5254006e85c2
2015-11-02 22:50:30 - https://testing.hpdd.intel.com/test_sets/52c3408c-81eb-11e5-846d-5254006e85c2
2015-11-10 11:55:14 - https://testing.hpdd.intel.com/test_sets/d6087414-87e8-11e5-b19e-5254006e85c2
2015-11-11 08:48:35 - https://testing.hpdd.intel.com/test_sets/2ecb581e-8889-11e5-9053-5254006e85c2
2015-11-15 22:30:59 - https://testing.hpdd.intel.com/test_sets/20febbe6-8c20-11e5-ae56-5254006e85c2
2015-11-15 23:51:07 - https://testing.hpdd.intel.com/test_sets/aa5507a0-8c2b-11e5-ae56-5254006e85c2
2015-12-06 22:52:07 - https://testing.hpdd.intel.com/test_sets/7b00656a-9ca4-11e5-8e88-5254006e85c2

Comment by Gerrit Updater [ 01/Dec/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17406
Subject: LU-7256 tests: wait LFSCK to exit before next test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8df8ab7ce104f873ed74bbf59355b7225b42d07f

Comment by James Nunez (Inactive) [ 30/Dec/15 ]

More failures on master:
2015-12-29 04:52:47 - https://testing.hpdd.intel.com/test_sets/12276386-ae1a-11e5-aa1f-5254006e85c2
2016-01-06 02:25:20 - https://testing.hpdd.intel.com/test_sets/da54d49c-b464-11e5-aa1f-5254006e85c2
2016-01-11 12:36:38 - https://testing.hpdd.intel.com/test_sets/9fc31ea2-b896-11e5-8c15-5254006e85c2
2016-01-20 12:27:31 - https://testing.hpdd.intel.com/test_sets/f0ef1be0-bf9f-11e5-8f04-5254006e85c2

Comment by Richard Henwood (Inactive) [ 20/Jan/16 ]

another failure on review-dne-part-2:

https://testing.hpdd.intel.com/test_sessions/30aec778-bf4e-11e5-8f04-5254006e85c2

Comment by Richard Henwood (Inactive) [ 20/Jan/16 ]

Also, review-dne-part-2+sanity-lfsck have have caused 2 of the last 10 canaires to fail:

https://wiki.hpdd.intel.com/display/~rhenwood/Canary+patch

Comment by nasf (Inactive) [ 21/Jan/16 ]

The patch http://review.whamcloud.com/#/c/17406/ has been refreshed. Please try with that.

Comment by Gerrit Updater [ 13/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17406/
Subject: LU-7256 tests: wait current LFSCK to exit before next test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6871453b053d5756aba2321122db4564df5c1c57

Comment by nasf (Inactive) [ 14/Mar/16 ]

The patch has been landed to master.

Generated at Sat Feb 10 02:07:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.