[LU-524] Test failure on test suite sanityn, subtest test_40c Created: 21/Jul/11  Updated: 25/Jul/11  Resolved: 25/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Won't Fix Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6139

 Description   

This issue was created by maloo for Chris Gearing <chris@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/69d07264-b345-11e0-b33f-52540025f9af.

The sub-test test_40c failed with the following error:

link is blocked

This error is occurring regularly on many tests and preventing successful completion.



 Comments   
Comment by nasf (Inactive) [ 21/Jul/11 ]

It is very strange issue, the failure reason is that:

on MDS side:
===================
00000100:00100000:0.0:1311197986.353819:0:3846:0:(service.c:1705:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_00:77dcbafc-6af0-ecc4-18e6-bd80fa9cb83b+11:23254:x1374890243444654:12345-10.10.4.100@tcp:36
00010000:00010000:0.0:1311197986.353833:0:3846:0:(ldlm_lockd.c:1412:ldlm_request_cancel()) ### server-side cancel handler START: 1 locks, starting at 0
00010000:00010000:0.0:1311197986.353836:0:3846:0:(ldlm_lockd.c:511:ldlm_del_waiting_lock()) ### wasn't waiting ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 3/0,0 mode: PR/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x4000000 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353855:0:3846:0:(ldlm_lockd.c:511:ldlm_del_waiting_lock()) ### wasn't waiting ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 3/0,0 mode: PR/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x24000080 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353862:0:3846:0:(ldlm_lock.c:195:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 0/0,0 mode: --/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x24000080 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353868:0:3846:0:(ldlm_lockd.c:1447:ldlm_request_cancel()) ### server-side cancel handler END
00010000:00010000:0.0:1311197986.353874:0:3846:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CW) ns: mdt-ffff81000af3b800 lock: ffff81004fb9fd80/0x5897b4e9bb6dc7dd lrc: 3/0,1 mode: --/CW res: 3660161/2491890387 bits 0x0 rrc: 6 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353880:0:3846:0:(ldlm_lock.c:607:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 6 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197986.353886:0:3846:0:(ldlm_request.c:309:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 4/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 6 type: IBT flags: 0x4004030 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197986.353892:0:3846:0:(ldlm_request.c:224:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: mdt-ffff81000af3b800 lock: ffff81004fb9fd80/0x5897b4e9bb6dc7dd lrc: 3/0,1 mode: --/CW res: 3660161/2491890387 bits 0x2 rrc: 6 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 3846 timeout: 0
===================

So the second link operation in sanityn test_40c (mdt_00) was blocked by the lock "ffff810055893900/0x5897b4e9bb6dc7ba". Such lock was created by another mdt thread: mdt_02 (4096)

===================
00000100:00100000:0.0:1311197985.450701:0:4096:0:(service.c:1705:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:4ba32714-4664-4650-51ba-87e8d8be7d5f+11:23245:x1374890243444652:12345-10.10.4.100@tcp:101
00010000:00010000:0.0:1311197985.450711:0:4096:0:(ldlm_lockd.c:1048:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00010000:0.0:1311197985.450722:0:4096:0:(ldlm_lockd.c:1133:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-ffff81000af3b800 lock: ffff81004fb9fb40/0x5897b4e9bb6dc7b3 lrc: 2/0,0 mode: --/CR res: 3660161/2491890387 bits 0x0 rrc: 4 type: IBT flags: 0x0 remote: 0x21fb2945a70a2a7f expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197985.450765:0:4096:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x0 rrc: 5 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197985.450772:0:4096:0:(ldlm_lock.c:607:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff81000af3b800 lock: ffff8100502da900/0x5897b4e9bb6dc774 lrc: 2/0,1 mode: CW/CW res: 3660161/2491890387 bits 0x2 rrc: 5 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 3847 timeout: 0
00010000:00010000:0.0:1311197985.450780:0:4096:0:(ldlm_request.c:309:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-ffff81000af3b800 lock: ffff8100502da900/0x5897b4e9bb6dc774 lrc: 3/0,1 mode: CW/CW res: 3660161/2491890387 bits 0x2 rrc: 5 type: IBT flags: 0x4004030 remote: 0x0 expref: -99 pid: 3847 timeout: 0
00010000:00010000:0.0:1311197985.450800:0:4096:0:(ldlm_request.c:224:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 5 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 4096 timeout: 0
===================

Means mdt_02 created the lock "ffff810055893900/0x5897b4e9bb6dc7ba", but it was blocked by another lock "ffff8100502da900/0x5897b4e9bb6dc774", the later lock was held by mdt_01 (3847), which was just the thread was processing the first link operation and slept for "OBD_FAIL_MDS_PDO_LOCK".

Then back to study the client-side log, the lock "ffff810055893900/0x5897b4e9bb6dc7ba" was triggered by the RPC of "x1374890243444652"

===================
00000080:00200000:0.0:1311197985.451290:0:23245:0:(file.c:2190:__ll_inode_revalidate_it()) VFS Op:inode=3660161/2491890387(ffff81006ae87b50),name=/
00000002:00010000:0.0:1311197985.451306:0:23245:0:(mdc_locks.c:917:mdc_intent_lock()) (name: ,[0x37d981:0x94873ad3:0x0]) in obj [0x37d981:0x94873ad3:0x0], intent: getattr flags 00
00010000:00010000:0.0:1311197985.451316:0:23245:0:(ldlm_lock.c:1208:ldlm_lock_match()) ### not matched ns ffff81006a88e000 type 13 mode 30 res 3660161/2491890387 (0 0)
00010000:00010000:0.0:1311197985.451360:0:23245:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x0 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00010000:00010000:0.0:1311197985.451366:0:23245:0:(ldlm_request.c:830:ldlm_cli_enqueue()) ### client-side enqueue START ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00010000:00010000:0.0:1311197985.451371:0:23245:0:(ldlm_request.c:892:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00000100:00100000:0.0:1311197985.451382:0:23245:0:(client.c:1392:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc updatedb:4ba32714-4664-4650-51ba-87e8d8be7d5f:23245:1374890243444652:10.10.4.102@tcp:101
===================

As shown above, it was the thread 23245 that triggered the RPC "x1374890243444652". Unfortunately, we do not know what the thread 23245 want to do, it was not any threads related with sanityn test_40c, it looks like a system daemon to scan the filesystem.

Comment by nasf (Inactive) [ 21/Jul/11 ]

The conclusion is that:

the second link was blocked by some unknown ops, and the unknown ops was blocked by the first link, but if without such unknown ops, the second link would not be blocked by the first link. So we should find out what the unknown ops was for and how to remove such unknown ops.

Comment by Peter Jones [ 21/Jul/11 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 22/Jul/11 ]

the unknown process (23245) is not part of the test processes, it was reading the dir (ll_readdir), including .lustre directory.

Comment by Peter Jones [ 25/Jul/11 ]

Thanks Bobi. Let's close this ticket then

Comment by Oleg Drokin [ 25/Jul/11 ]

I bet the unknown process is "updatedb" and the test failure is invalid as the result
we need to update /etc/updatedb.conf to not inlue lustre mountpoints there

Generated at Sat Feb 10 01:07:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.