[LU-7302] sanity-lfsck test_23b hangs on START_NAMESPACE Created: 15/Oct/15  Updated: 25/Jan/17  Resolved: 14/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: parinay v kondekar (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Configuration : 4 node setup : 1 MDS/ 1 OSS/ 2 clients
Release
Server 2.7.60
Client 2.7.60


Attachments: File 23b.lctl.tgz     File 26th.oct.23b.lctl.tgz     Text File MRP_2979_ptldebug5.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
stdout.log
Total allocated inode limit: 0, total allocated block limit: 0

== sanity-lfsck test 23b: LFSCK can repair dangling name entry (2) == 01:58:28 (1443319108)
#####
The objectA has multiple hard links, one of them corresponding
to the name entry_B. But there is something wrong for the name
entry_B and cause entry_B to references non-exist object_C.
In the first-stage scanning, the LFSCK will think the entry_B
as dangling, and re-create the lost object_C. When the LFSCK
comes to the second-stage scanning, it will find that the
former re-creating object_C is not proper, and will try to
replace the object_C with the real object_A.
Inject failure stub on MDT0 to simulate dangling name entry
fail_loc=0x1621
fail_loc=0
'ls' should fail because of dangling name entry
Trigger namespace LFSCK to find out dangling name entry


stderr.log
pdsh@fre0203: fre0201: ssh exited with exit code 1
pdsh@fre0203: fre0202: ssh exited with exit code 1
Using TIMEOUT=20
excepting tests: 



In the ticket LU-7217 similar timeout is seen during test_23c due to

#define OBD_FAIL_LFSCK_DELAY3           0x1602
        do_facet $SINGLEMDS $LCTL set_param fail_val=10 fail_loc=0x1602

in test_23b I see following,

4178 00000080:00200000:1.0:1443429106.393336:0:14812:0:(dcache.c:424:ll_revalidate_nd()) VFS Op:name=d23b.sanity-lfsck, flags=0
4179 00000080:00200000:1.0:1443429106.393338:0:14812:0:(file.c:3217:__ll_inode_revalidate()) VFS Op:inode=[0x200000401:0x1:0x0](ffff88011bc48140),name=d23b.sanity-lfsck
4180 00000002:00010000:1.0:1443429106.393339:0:14812:0:(mdc_locks.c:1114:mdc_intent_lock()) (name: ,[0x200000401:0x1:0x0]) in obj [0x200000401:0x1:0x0], intent: getattr flags 00
4181 00010000:00010000:1.0:1443429106.393341:0:14812:0:(ldlm_lock.c:1493:ldlm_lock_match()) ### not matched ns ffff88011a7faa00 type 13 mode 30 res 8589935617/1 (0 0)
4182 00010000:00010000:1.0:1443429106.393350:0:14812:0:(ldlm_lock.c:746:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb     9fa48c lrc: 3/1,0 mode: --/CR res: [0x200000401:0x1:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4183 00010000:00010000:1.0:1443429106.393352:0:14812:0:(ldlm_request.c:925:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000
4184  ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48c lrc: 3/1,0 mode: --/CR res: [0x200000401:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref:      -99 pid: 14812 timeout: 0 lvb_type: 0
4185 00010000:00010000:1.0:1443429106.393355:0:14812:0:(ldlm_request.c:980:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48c lrc: 3/1,0      mode: --/CR res: [0x200000401:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4186 00000100:00100000:1.0:1443429106.393360:0:14812:0:(client.c:1575:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc rm:d5ce60dd-b58f-953f-4b1f-b15d8912a206:14812:1513543112333784:192.168.101     .33@tcp:101
4187 00000100:00100000:1.0:1443429106.393384:0:14812:0:(client.c:2255:ptlrpc_set_wait()) set ffff88011ae58680 going to sleep for 6 seconds
4188 00000100:00100000:1.0:1443429106.396347:0:14812:0:(client.c:1970:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc rm:d5ce60dd-b58f-953f-4b1f-b15d8912a206:14812:1513543112333784:192.168.101.     33@tcp:101
4189 00010000:00010000:1.0:1443429106.396351:0:14812:0:(ldlm_request.c:639:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c     2a0fb9fa48c lrc: 4/1,0 mode: --/CR res: [0x200000401:0x1:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4190 00010000:00010000:1.0:1443429106.396354:0:14812:0:(ldlm_request.c:665:ldlm_cli_enqueue_fini()) ### client-side enqueue, new policy data ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d     58c2a0fb9fa48c lrc: 4/1,0 mode: --/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4191 00010000:00010000:1.0:1443429106.396356:0:14812:0:(ldlm_lock.c:1047:ldlm_granted_list_add_lock()) ### About to add lock: ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48c      lrc: 4/1,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4192 00010000:00010000:1.0:1443429106.396359:0:14812:0:(ldlm_request.c:710:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48     c lrc: 4/1,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4193 00010000:00010000:1.0:1443429106.396362:0:14812:0:(ldlm_lock.c:746:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb     9fa48c lrc: 5/2,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4194 00010000:00010000:1.0:1443429106.396364:0:14812:0:(ldlm_lock.c:800:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb     9fa48c lrc: 5/2,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4195 00010000:00010000:1.0:1443429106.396366:0:14812:0:(ldlm_lock.c:891:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0     fb9fa48c lrc: 4/1,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4196 00000002:00100000:1.0:1443429106.396369:0:14812:0:(mdc_locks.c:609:mdc_finish_enqueue()) @@@ op: 8 disposition: b, status: 0  req@ffff88011bf403c0 x1513543112333784/t0(0) o101->lustre-MDT0000-mdc-ffff880     1198a9400@192.168.101.33@tcp:12/10 lens 696/536 e 0 to 0 dl 1443429113 ref 1 fl Complete:R/0/0 rc 0/0
4197 00000002:00010000:1.0:1443429106.396373:0:14812:0:(mdc_locks.c:979:mdc_finish_intent_lock()) ### matching against this ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48c lr     c: 3/1,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4198 00010000:00010000:1.0:1443429106.396383:0:14812:0:(ldlm_lock.c:1493:ldlm_lock_match()) ### not matched ns ffff88011a7faa00 type 13 mode 4 res 8589935617/1 (0 0)
4199 00000080:00200000:1.0:1443429106.396390:0:14812:0:(llite_lib.c:1952:ll_update_inode()) inode=[0x200000401:0x1:0x0], updating i_size 4096
4200 00000080:00010000:1.0:1443429106.396392:0:14812:0:(dcache.c:346:ll_lookup_finish_locks()) setting l_data to inode [0x200000401:0x1:0x0](ffff88011bc48140)
4201 00000080:00010000:1.0:1443429106.396393:0:14812:0:(llite_internal.h:1394:ll_set_lock_data()) setting l_data to inode [0x200000401:0x1:0x0](ffff88011bc48140) for lock 0x2d58c2a0fb9fa48c
4202 00000080:00010000:1.0:1443429106.396395:0:14812:0:(dcache.c:249:ll_intent_drop_lock()) releasing lock with cookie 0x2d58c2a0fb9fa48c from it ffff88011bf77d28
4203 00010000:00010000:1.0:1443429106.396395:0:14812:0:(ldlm_lock.c:800:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb     9fa48c lrc: 3/1,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4204 00010000:00010000:1.0:1443429106.396397:0:14812:0:(ldlm_lock.c:874:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd64ac0/0x2d58c2a0fb9fa48     c lrc: 2/0,0 mode: PR/PR res: [0x200000401:0x1:0x0].0 bits 0x13 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129b042 expref: -99 pid: 14812 timeout: 0 lvb_type: 0
4205 00000080:00200000:1.0:1443429106.396417:0:14812:0:(file.c:3217:__ll_inode_revalidate()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88011bc49c40),name=/
4206 00000002:00010000:1.0:1443429106.396419:0:14812:0:(mdc_locks.c:1114:mdc_intent_lock()) (name: ,[0x200000007:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: lookup flags 00
4207 00010000:00010000:1.0:1443429106.396421:0:14812:0:(ldlm_lock.c:746:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd644c0/0x2d58c2a0fb     9fa415 lrc: 2/1,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x11 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129ae97 expref: -99 pid: 11235 timeout: 0 lvb_type: 0
4208 00010000:00010000:1.0:1443429106.396423:0:14812:0:(ldlm_lock.c:1472:ldlm_lock_match()) ### matched (0 0) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd644c0/0x2d58c2a0fb9fa415 lrc: 2/1,0 mode:      PR/PR res: [0x200000007:0x1:0x0].0 bits 0x11 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x1c3b041d4129ae97 expref: -99 pid: 11235 timeout: 0 lvb_type: 0
4209 00000080:00010000:1.0:1443429106.396426:0:14812:0:(dcache.c:346:ll_lookup_finish_locks()) setting l_data to inode [0x200000007:0x1:0x0](ffff88011bc49c40)
4210 00000080:00010000:1.0:1443429106.396427:0:14812:0:(llite_internal.h:1394:ll_set_lock_data()) setting l_data to inode [0x200000007:0x1:0x0](ffff88011bc49c40) for lock 0x2d58c2a0fb9fa415
4211 00000080:00010000:1.0:1443429106.396428:0:14812:0:(dcache.c:249:ll_intent_drop_lock()) releasing lock with cookie 0x2d58c2a0fb9fa415 from it ffff88011bf77b68
4212 00010000:00010000:1.0:1443429106.396428:0:14812:0:(ldlm_lock.c:800:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd644c0/0x2d58c2a0fb     9fa415 lrc: 3/1,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x11 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129ae97 expref: -99 pid: 11235 timeout: 0 lvb_type: 0
4213 00010000:00010000:1.0:1443429106.396431:0:14812:0:(ldlm_lock.c:874:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-MDT0000-mdc-ffff8801198a9400 lock: ffff88011bd644c0/0x2d58c2a0fb9fa41     5 lrc: 2/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x11 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x1c3b041d4129ae97 expref: -99 pid: 11235 timeout: 0 lvb_type: 0
4214 00000080:00200000:1.0:1443429106.396435:0:14812:0:(file.c:3510:ll_inode_permission()) VFS Op:inode=[0x200000007:0x1:0x0](ffff88011bc49c40), inode mode 41ed mask 1


 Comments   
Comment by Joseph Gmitter (Inactive) [ 15/Oct/15 ]

Hi Fan Yong,
Can you have a look to see if this is a failure?
Thanks.
Joe

Comment by nasf (Inactive) [ 23/Oct/15 ]

The debug log does not contains valid lfsck information, the lfsck entry function lfsck_start() has not been run. So the system should has been blocked before that. If the issues can be reproduced, please enable "-1" level debug on the MDS/OSS to collect more information. The "dmesg" is also valuable sometimes.

Comment by parinay v kondekar (Inactive) [ 26/Oct/15 ]

Latest lctl.dk

Comment by nasf (Inactive) [ 02/Nov/15 ]

The latest lctl.dk shows that the test_23b was NOT hung as original exported. But the test result still failed because no dangling name entry was found, instead, the dangling name entry was removed as invalid one.

00100000:10000000:0.0:1444529448.836945:0:31689:0:(lfsck_namespace.c:2155:lfsck_namespace_repair_dirent()) lustre-MDT0000-osd: namespace LFSCK assistant found bad name entry for: parent [0x200000401:0x2:0x0], child [0x200000402:0x1:0x0], name foo, type in name entry 100000, type claimed by child 0. repair it by removing with new name2 foo: rc = 1

That is unexpected. Unfortunately, the -1 level debug log was not enabled (on the MDS) during the test. So we cannot know what caused such unexpected result.

Parinay, would you please to reproduce the failure with -1 level debug enabled on the MDS? Thanks!
For example:

PTLDEBUG=-1 sh sanity-lfsck.sh
Comment by parinay v kondekar (Inactive) [ 13/Jan/16 ]

Fan Yong,
Kirtan, tried, to reproduce this with PTLDEBUG=-1 on single node setup. I am attaching the same here. We are trying to reproduce it on 4 node setup.

Thanks.

Comment by nasf (Inactive) [ 27/Jan/16 ]

There are two reasons for the failure:

1) OI scrub has been triggered unexpectedly when lookup an remove inode. Such issue has been fixed via the patch http://review.whamcloud.com/16439

00100000:10000000:0.0:1452490115.979072:0:4842:0:(osd_scrub.c:653:osd_scrub_prep()) lustre-MDT0000: OI scrub prep, flags = 0x52
00100000:10000000:0.0:1452490115.979078:0:4842:0:(osd_scrub.c:1488:osd_scrub_main()) lustre-MDT0000: OI scrub start, flags = 0x52, pos = 12
00080000:12000000:0.0:1452490115.979079:0:4137:0:(osd_handler.c:3948:osd_consistency_check()) lustre-MDT0000: trigger OI scrub by RPC for [0x200000401:0x4:0x0], rc = 0 [2]

2) There is some improper logic inside OI scrub, and caused the osd_otable_it_next() and OI scrub thread to wait for each other.

00100000:00000001:0.0:1452490116.019213:0:4853:0:(osd_scrub.c:2784:osd_otable_it_load()) Process entered
00100000:00000001:0.0:1452490116.019214:0:4853:0:(osd_scrub.c:2686:osd_otable_it_next()) Process entered

The hung point is at:

static int osd_otable_it_next(const struct lu_env *env, struct dt_it *di)
{
...
        if (it->ooi_cache.ooc_pos_preload >= scrub->os_pos_current)
                l_wait_event(thread->t_ctl_waitq,
                             osd_otable_it_wakeup(scrub, it),
                             &lwi);
...
}

With the 1st issue resolved, test_23b trouble will not be triggered again, but the 2nd issue still may cause other problem. I will make patch to fix that.

Comment by Gerrit Updater [ 27/Jan/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18175
Subject: LU-7302 scrub: join the running OI scrub properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 654582bb0821abc53f85c4f755295baec6fb7a28

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18175/
Subject: LU-7302 scrub: join the running OI scrub properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 93205e724bd7a07ef56636d1042a430777b89405

Generated at Sat Feb 10 02:07:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.