[LU-10045] sanity-lfsck no sub tests failed Created: 28/Sep/17  Updated: 19/Mar/18  Resolved: 31/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: James Casper Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx, full
servers: el7, zfs, branch master, v2.10.53.1, b3642
clients: el7, branch master, v2.10.53.1, b3642


Issue Links:
Duplicate
is duplicated by LU-10242 parallel-scale no sub tests failed: t... Closed
Related
is related to LU-7690 sanity-lfsck: couldn't mount ost Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/ecf02eea-e0bb-48ce-a644-fed2036c2478

From suite_log:

Starting ost1:   lustre-ost1/ost1 /mnt/lustre-ost1
CMD: onyx-35vm12 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1
onyx-35vm12: mount.lustre: mount lustre-ost1/ost1 at /mnt/lustre-ost1 failed: 
  Cannot send after transport endpoint shutdown

The subsequent three test sets (sanityn, sanity-hsm, & sanity-lsnapshot) also run no subtests and have suite logs that end with this message:

according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/lustre-mds1


 Comments   
Comment by Peter Jones [ 11/Oct/17 ]

Fan Yong

Could you please advise on this one?

Thanks

Peter

Comment by nasf (Inactive) [ 24/Nov/17 ]

All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect.

So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason.

Comment by nasf (Inactive) [ 29/Nov/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/25e467c6-d4fa-11e7-a066-52540065bddc

Comment by nasf (Inactive) [ 08/Dec/17 ]

The OST hit trouble when umount from former tests:
https://testing.hpdd.intel.com/test_logs/26247ed8-d4fa-11e7-a066-52540065bddc/show_text

[11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2
[11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order):
[11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2
[11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075]  req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail
[11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5
[11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129]  req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194]  req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
...
Comment by nasf (Inactive) [ 06/Jan/18 ]

The reason is described as the comment:
https://jira.hpdd.intel.com/browse/LU-10406?focusedCommentId=217655&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-217655

Comment by Gerrit Updater [ 06/Jan/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761
Subject: LU-10045 mgc: multiple try when register target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d62d4132162c15cacc260e5d27abc2522f59d72d

Comment by Gerrit Updater [ 31/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/
Subject: LU-10045 obdclass: multiple try when register target
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 79bfc74869e3f7b052874f4585399c5ba7f599e9

Comment by Minh Diep [ 31/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 14/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301
Subject: LU-10045 obdclass: multiple try when register target
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 44bb3ee8961d13618e6d670d6e3005c2729a723e

Comment by Gerrit Updater [ 19/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/
Subject: LU-10045 obdclass: multiple try when register target
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: aa99a7bb77cce480ff5753238d857a0eb797e5fe

Generated at Sat Feb 10 02:31:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.