[LU-10045] sanity-lfsck no sub tests failed Created: 28/Sep/17 Updated: 19/Mar/18 Resolved: 31/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Casper | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx, full |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/ecf02eea-e0bb-48ce-a644-fed2036c2478 From suite_log: Starting ost1: lustre-ost1/ost1 /mnt/lustre-ost1 CMD: onyx-35vm12 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1 onyx-35vm12: mount.lustre: mount lustre-ost1/ost1 at /mnt/lustre-ost1 failed: Cannot send after transport endpoint shutdown The subsequent three test sets (sanityn, sanity-hsm, & sanity-lsnapshot) also run no subtests and have suite logs that end with this message: according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/lustre-mds1 |
| Comments |
| Comment by Peter Jones [ 11/Oct/17 ] |
|
Fan Yong Could you please advise on this one? Thanks Peter |
| Comment by nasf (Inactive) [ 24/Nov/17 ] |
|
All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect. So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason. |
| Comment by nasf (Inactive) [ 29/Nov/17 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 08/Dec/17 ] |
|
The OST hit trouble when umount from former tests: [11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 [11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2 [11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): [11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2 [11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075] req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail [11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5 [11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129] req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194] req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 ... |
| Comment by nasf (Inactive) [ 06/Jan/18 ] |
|
The reason is described as the comment: |
| Comment by Gerrit Updater [ 06/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761 |
| Comment by Gerrit Updater [ 31/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/ |
| Comment by Minh Diep [ 31/Jan/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 14/Feb/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301 |
| Comment by Gerrit Updater [ 19/Mar/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/ |