[LU-10045] sanity-lfsck no sub tests failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4
Affects Version/s: Lustre 2.11.0
Labels:
None
Environment:
onyx, full
servers: el7, zfs, branch master, v2.10.53.1, b3642
clients: el7, branch master, v2.10.53.1, b3642

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

https://testing.hpdd.intel.com/test_sessions/ecf02eea-e0bb-48ce-a644-fed2036c2478

From suite_log:

Starting ost1:   lustre-ost1/ost1 /mnt/lustre-ost1
CMD: onyx-35vm12 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1
onyx-35vm12: mount.lustre: mount lustre-ost1/ost1 at /mnt/lustre-ost1 failed: 
  Cannot send after transport endpoint shutdown

The subsequent three test sets (sanityn, sanity-hsm, & sanity-lsnapshot) also run no subtests and have suite logs that end with this message:

according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/lustre-mds1

Attachments

Issue Links

is duplicated by

LU-10242 parallel-scale no sub tests failed: test failed to respond and timed out

Closed

is related to

LU-7690 sanity-lfsck: couldn't mount ost

Closed

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-10045] sanity-lfsck no sub tests failed

Gerrit Updater added a comment - 19/Mar/18 8:09 PM

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/
Subject: ~~LU-10045~~ obdclass: multiple try when register target
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: aa99a7bb77cce480ff5753238d857a0eb797e5fe

Gerrit Updater added a comment - 19/Mar/18 8:09 PM John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/ Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: aa99a7bb77cce480ff5753238d857a0eb797e5fe

Gerrit Updater added a comment - 14/Feb/18 3:40 PM

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301
Subject: ~~LU-10045~~ obdclass: multiple try when register target
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 44bb3ee8961d13618e6d670d6e3005c2729a723e

Gerrit Updater added a comment - 14/Feb/18 3:40 PM Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301 Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 44bb3ee8961d13618e6d670d6e3005c2729a723e

Minh Diep added a comment - 31/Jan/18 3:22 PM

Landed for 2.11

Minh Diep added a comment - 31/Jan/18 3:22 PM Landed for 2.11

Gerrit Updater added a comment - 31/Jan/18 5:51 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/
Subject: ~~LU-10045~~ obdclass: multiple try when register target
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 79bfc74869e3f7b052874f4585399c5ba7f599e9

Gerrit Updater added a comment - 31/Jan/18 5:51 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/ Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: master Current Patch Set: Commit: 79bfc74869e3f7b052874f4585399c5ba7f599e9

Gerrit Updater added a comment - 06/Jan/18 6:39 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761
Subject: ~~LU-10045~~ mgc: multiple try when register target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d62d4132162c15cacc260e5d27abc2522f59d72d

Gerrit Updater added a comment - 06/Jan/18 6:39 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761 Subject: LU-10045 mgc: multiple try when register target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d62d4132162c15cacc260e5d27abc2522f59d72d

nasf (Inactive) added a comment - 06/Jan/18 4:45 AM

The reason is described as the comment:
https://jira.hpdd.intel.com/browse/LU-10406?focusedCommentId=217655&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-217655

nasf (Inactive) added a comment - 06/Jan/18 4:45 AM The reason is described as the comment: https://jira.hpdd.intel.com/browse/LU-10406?focusedCommentId=217655&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-217655

nasf (Inactive) added a comment - 08/Dec/17 3:03 AM

The OST hit trouble when umount from former tests:
https://testing.hpdd.intel.com/test_logs/26247ed8-d4fa-11e7-a066-52540065bddc/show_text

[11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2
[11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order):
[11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2
[11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075]  req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail
[11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5
[11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129]  req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194]  req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
...

nasf (Inactive) added a comment - 08/Dec/17 3:03 AM The OST hit trouble when umount from former tests: https://testing.hpdd.intel.com/test_logs/26247ed8-d4fa-11e7-a066-52540065bddc/show_text [11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 [11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2 [11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): [11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2 [11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075] req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail [11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5 [11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129] req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194] req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 ...

nasf (Inactive) added a comment - 29/Nov/17 12:05 PM

+1 on master:
https://testing.hpdd.intel.com/test_sets/25e467c6-d4fa-11e7-a066-52540065bddc

nasf (Inactive) added a comment - 29/Nov/17 12:05 PM +1 on master: https://testing.hpdd.intel.com/test_sets/25e467c6-d4fa-11e7-a066-52540065bddc

nasf (Inactive) added a comment - 24/Nov/17 3:36 PM

All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect.

So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason.

nasf (Inactive) added a comment - 24/Nov/17 3:36 PM All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect. So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason.

Peter Jones added a comment - 11/Oct/17 6:44 PM

Fan Yong

Could you please advise on this one?

Thanks

Peter

Peter Jones added a comment - 11/Oct/17 6:44 PM Fan Yong Could you please advise on this one? Thanks Peter

People

Assignee:: nasf (Inactive)

Reporter:: James Casper (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Sep/17 10:24 PM

Updated:: 19/Mar/18 9:02 PM

Resolved:: 31/Jan/18 7:59 AM