|
It is very strange issue, the failure reason is that:
on MDS side:
===================
00000100:00100000:0.0:1311197986.353819:0:3846:0:(service.c:1705:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_00:77dcbafc-6af0-ecc4-18e6-bd80fa9cb83b+11:23254:x1374890243444654:12345-10.10.4.100@tcp:36
00010000:00010000:0.0:1311197986.353833:0:3846:0:(ldlm_lockd.c:1412:ldlm_request_cancel()) ### server-side cancel handler START: 1 locks, starting at 0
00010000:00010000:0.0:1311197986.353836:0:3846:0:(ldlm_lockd.c:511:ldlm_del_waiting_lock()) ### wasn't waiting ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 3/0,0 mode: PR/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x4000000 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353855:0:3846:0:(ldlm_lockd.c:511:ldlm_del_waiting_lock()) ### wasn't waiting ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 3/0,0 mode: PR/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x24000080 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353862:0:3846:0:(ldlm_lock.c:195:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-ffff81000af3b800 lock: ffff810079199b40/0x5897b4e9bb6dc7d6 lrc: 0/0,0 mode: --/PR res: 8589941617/10041 bits 0x3 rrc: 2 type: IBT flags: 0x24000080 remote: 0x21fb2945a70a2a86 expref: 11 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353868:0:3846:0:(ldlm_lockd.c:1447:ldlm_request_cancel()) ### server-side cancel handler END
00010000:00010000:0.0:1311197986.353874:0:3846:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CW) ns: mdt-ffff81000af3b800 lock: ffff81004fb9fd80/0x5897b4e9bb6dc7dd lrc: 3/0,1 mode: --/CW res: 3660161/2491890387 bits 0x0 rrc: 6 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 3846 timeout: 0
00010000:00010000:0.0:1311197986.353880:0:3846:0:(ldlm_lock.c:607:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 6 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197986.353886:0:3846:0:(ldlm_request.c:309:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 4/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 6 type: IBT flags: 0x4004030 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197986.353892:0:3846:0:(ldlm_request.c:224:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: mdt-ffff81000af3b800 lock: ffff81004fb9fd80/0x5897b4e9bb6dc7dd lrc: 3/0,1 mode: --/CW res: 3660161/2491890387 bits 0x2 rrc: 6 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 3846 timeout: 0
===================
So the second link operation in sanityn test_40c (mdt_00) was blocked by the lock "ffff810055893900/0x5897b4e9bb6dc7ba". Such lock was created by another mdt thread: mdt_02 (4096)
===================
00000100:00100000:0.0:1311197985.450701:0:4096:0:(service.c:1705:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:4ba32714-4664-4650-51ba-87e8d8be7d5f+11:23245:x1374890243444652:12345-10.10.4.100@tcp:101
00010000:00010000:0.0:1311197985.450711:0:4096:0:(ldlm_lockd.c:1048:ldlm_handle_enqueue0()) ### server-side enqueue handler START
00010000:00010000:0.0:1311197985.450722:0:4096:0:(ldlm_lockd.c:1133:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-ffff81000af3b800 lock: ffff81004fb9fb40/0x5897b4e9bb6dc7b3 lrc: 2/0,0 mode: --/CR res: 3660161/2491890387 bits 0x0 rrc: 4 type: IBT flags: 0x0 remote: 0x21fb2945a70a2a7f expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197985.450765:0:4096:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x0 rrc: 5 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 4096 timeout: 0
00010000:00010000:0.0:1311197985.450772:0:4096:0:(ldlm_lock.c:607:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-ffff81000af3b800 lock: ffff8100502da900/0x5897b4e9bb6dc774 lrc: 2/0,1 mode: CW/CW res: 3660161/2491890387 bits 0x2 rrc: 5 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 3847 timeout: 0
00010000:00010000:0.0:1311197985.450780:0:4096:0:(ldlm_request.c:309:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-ffff81000af3b800 lock: ffff8100502da900/0x5897b4e9bb6dc774 lrc: 3/0,1 mode: CW/CW res: 3660161/2491890387 bits 0x2 rrc: 5 type: IBT flags: 0x4004030 remote: 0x0 expref: -99 pid: 3847 timeout: 0
00010000:00010000:0.0:1311197985.450800:0:4096:0:(ldlm_request.c:224:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: mdt-ffff81000af3b800 lock: ffff810055893900/0x5897b4e9bb6dc7ba lrc: 3/1,0 mode: --/PR res: 3660161/2491890387 bits 0x3 rrc: 5 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 4096 timeout: 0
===================
Means mdt_02 created the lock "ffff810055893900/0x5897b4e9bb6dc7ba", but it was blocked by another lock "ffff8100502da900/0x5897b4e9bb6dc774", the later lock was held by mdt_01 (3847), which was just the thread was processing the first link operation and slept for "OBD_FAIL_MDS_PDO_LOCK".
Then back to study the client-side log, the lock "ffff810055893900/0x5897b4e9bb6dc7ba" was triggered by the RPC of "x1374890243444652"
===================
00000080:00200000:0.0:1311197985.451290:0:23245:0:(file.c:2190:__ll_inode_revalidate_it()) VFS Op:inode=3660161/2491890387(ffff81006ae87b50),name=/
00000002:00010000:0.0:1311197985.451306:0:23245:0:(mdc_locks.c:917:mdc_intent_lock()) (name: ,[0x37d981:0x94873ad3:0x0]) in obj [0x37d981:0x94873ad3:0x0], intent: getattr flags 00
00010000:00010000:0.0:1311197985.451316:0:23245:0:(ldlm_lock.c:1208:ldlm_lock_match()) ### not matched ns ffff81006a88e000 type 13 mode 30 res 3660161/2491890387 (0 0)
00010000:00010000:0.0:1311197985.451360:0:23245:0:(ldlm_lock.c:668:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x0 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00010000:00010000:0.0:1311197985.451366:0:23245:0:(ldlm_request.c:830:ldlm_cli_enqueue()) ### client-side enqueue START ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00010000:00010000:0.0:1311197985.451371:0:23245:0:(ldlm_request.c:892:ldlm_cli_enqueue()) ### sending request ns: lustre-MDT0000-mdc-ffff81006b0dcc00 lock: ffff81006ad28000/0x21fb2945a70a2a7f lrc: 3/1,0 mode: --/CR res: 3660161/2491890387 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote: 0x0 expref: -99 pid: 23245 timeout: 0
00000100:00100000:0.0:1311197985.451382:0:23245:0:(client.c:1392:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc updatedb:4ba32714-4664-4650-51ba-87e8d8be7d5f:23245:1374890243444652:10.10.4.102@tcp:101
===================
As shown above, it was the thread 23245 that triggered the RPC "x1374890243444652". Unfortunately, we do not know what the thread 23245 want to do, it was not any threads related with sanityn test_40c, it looks like a system daemon to scan the filesystem.
|