[LU-2700] mdt deadlock in rename(?) Created: 29/Jan/13 Updated: 31/Jan/13 Resolved: 31/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | NFBlocker | ||
| Severity: | 3 |
| Rank (Obsolete): | 6293 |
| Description |
|
It seems a deadlock in rename was introduced this Friday 25th. [77051.148056] LNet: Service thread pid 6227 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [77051.148872] Pid: 6227, comm: mdt01_008 [77051.149108] [77051.149108] Call Trace: [77051.149529] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [77051.149810] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [77051.150071] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [77051.150820] [<ffffffffa0bbf771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [77051.151116] [<ffffffffa0f7ed8d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [77051.151398] [<ffffffffa0f7a6a0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [77051.152764] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77051.153306] [<ffffffffa0f7e4d8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [77051.153618] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.153917] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.154214] [<ffffffffa080984a>] mdt_reint_rename+0x7ba/0x1a70 [mdt] [77051.154580] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.154895] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.155220] [<ffffffffa0e29700>] ? lu_ucred+0x20/0x30 [obdclass] [77051.156752] [<ffffffffa07e7ff5>] ? mdt_ucred+0x15/0x20 [mdt] [77051.157068] [<ffffffffa080229c>] ? mdt_root_squash+0x2c/0x410 [mdt] [77051.157515] [<ffffffffa0fd16f6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [77051.157839] [<ffffffffa0e29700>] ? lu_ucred+0x20/0x30 [obdclass] [77051.158785] [<ffffffffa0806a51>] mdt_reint_rec+0x41/0xe0 [mdt] [77051.159066] [<ffffffffa08000f3>] mdt_reint_internal+0x4f3/0x7e0 [mdt] [77051.159358] [<ffffffffa0800424>] mdt_reint+0x44/0xe0 [mdt] [77051.159609] [<ffffffffa07f0732>] mdt_handle_common+0x8e2/0x1690 [mdt] [77051.163421] [<ffffffffa0825e95>] mds_regular_handle+0x15/0x20 [mdt] [77051.163763] [<ffffffffa0fb5af3>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] [77051.164304] [<ffffffffa0bbf65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [77051.164635] [<ffffffffa0fae871>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [77051.164923] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [77051.165209] [<ffffffffa0fb85fd>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] [77051.165478] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.166098] [<ffffffff8100c14a>] child_rip+0xa/0x20 [77051.166279] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.166470] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.166643] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [77051.166802] [77051.166922] LustreError: dumping log to /tmp/lustre-log.1359282032.6227 [77051.270779] LNet: Service thread pid 30749 was inactive for 62.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [77051.271595] Pid: 30749, comm: mdt00_004 [77051.271845] [77051.271846] Call Trace: [77051.272326] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [77051.272606] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [77051.272926] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [77051.273228] [<ffffffffa0bbf771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [77051.273570] [<ffffffffa0f7ed8d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [77051.273900] [<ffffffffa0f7a6a0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [77051.274404] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77051.274723] [<ffffffffa0f7e4d8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [77051.275057] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.275395] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.275709] [<ffffffffa080984a>] mdt_reint_rename+0x7ba/0x1a70 [mdt] [77051.276044] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.276376] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.276723] [<ffffffffa0e29700>] ? lu_ucred+0x20/0x30 [obdclass] [77051.277061] [<ffffffffa07e7ff5>] ? mdt_ucred+0x15/0x20 [mdt] [77051.277363] [<ffffffffa080229c>] ? mdt_root_squash+0x2c/0x410 [mdt] [77051.277695] [<ffffffffa0fd16f6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [77051.278010] [<ffffffffa0e29700>] ? lu_ucred+0x20/0x30 [obdclass] [77051.278299] [<ffffffffa0806a51>] mdt_reint_rec+0x41/0xe0 [mdt] [77051.278606] [<ffffffffa08000f3>] mdt_reint_internal+0x4f3/0x7e0 [mdt] [77051.278916] [<ffffffffa0800424>] mdt_reint+0x44/0xe0 [mdt] [77051.279202] [<ffffffffa07f0732>] mdt_handle_common+0x8e2/0x1690 [mdt] [77051.279526] [<ffffffffa0825e95>] mds_regular_handle+0x15/0x20 [mdt] [77051.279864] [<ffffffffa0fb5af3>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] [77051.280418] [<ffffffffa0bbf65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [77051.280734] [<ffffffffa0fae871>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [77051.281920] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77051.282248] [<ffffffffa0fb85fd>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] [77051.282606] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.282906] [<ffffffff8100c14a>] child_rip+0xa/0x20 [77051.283203] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.283529] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.283830] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [77051.284145] [77051.304078] Pid: 819, comm: mdt01_005 [77051.304351] [77051.304352] Call Trace: [77051.304821] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [77051.305120] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [77051.305424] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [77051.305728] [<ffffffffa0bbf771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [77051.306062] [<ffffffffa0f7ed8d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [77051.306400] [<ffffffffa0f7a6a0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [77051.306905] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77051.307224] [<ffffffffa0f7e4d8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [77051.307562] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.307892] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.308233] [<ffffffffa07eadc0>] mdt_object_lock0+0x360/0xb10 [mdt] [77051.308525] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.309085] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.309409] [<ffffffffa07eb614>] mdt_object_lock+0x14/0x20 [mdt] [77051.309714] [<ffffffffa07fd9c5>] mdt_getattr_name_lock+0xee5/0x1980 [mdt] [77051.310049] [<ffffffffa0fa6275>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [77051.310377] [<ffffffffa0fd16f6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [77051.310710] [<ffffffffa0fa8474>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [77051.311017] [<ffffffffa07fe6fd>] mdt_intent_getattr+0x29d/0x490 [mdt] [77051.311280] [<ffffffffa07fc26e>] mdt_intent_policy+0x3ae/0x750 [mdt] [77051.311568] [<ffffffffa0f5f70a>] ldlm_lock_enqueue+0x2ea/0x870 [ptlrpc] [77051.311843] [<ffffffffa0f85f07>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc] [77051.312143] [<ffffffffa07fc746>] mdt_enqueue+0x46/0x110 [mdt] [77051.312420] [<ffffffffa07f0732>] mdt_handle_common+0x8e2/0x1690 [mdt] [77051.312690] [<ffffffffa0825e95>] mds_regular_handle+0x15/0x20 [mdt] [77051.312986] [<ffffffffa0fb5af3>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] [77051.313421] [<ffffffffa0bbf65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [77051.313690] [<ffffffffa0fae871>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [77051.313952] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [77051.314203] [<ffffffffa0fb85fd>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] [77051.314477] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.314730] [<ffffffff8100c14a>] child_rip+0xa/0x20 [77051.314977] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.315274] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.315605] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [77051.315836] [77051.316029] LustreError: dumping log to /tmp/lustre-log.1359282032.819 [77051.372055] Pid: 29786, comm: mdt00_001 [77051.372306] [77051.372307] Call Trace: [77051.372715] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [77051.372988] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [77051.373257] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [77051.373537] [<ffffffffa0bbf771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [77051.373850] [<ffffffffa0f7ed8d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [77051.374153] [<ffffffffa0f7a6a0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [77051.374620] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77051.374910] [<ffffffffa0f7e4d8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [77051.375217] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.375532] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.375816] [<ffffffffa07eadc0>] mdt_object_lock0+0x360/0xb10 [mdt] [77051.376135] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77051.376447] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77051.376746] [<ffffffffa07eb614>] mdt_object_lock+0x14/0x20 [mdt] [77051.377025] [<ffffffffa07fd9c5>] mdt_getattr_name_lock+0xee5/0x1980 [mdt] [77051.377339] [<ffffffffa0fa6275>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [77051.377641] [<ffffffffa0fd16f6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [77051.377977] [<ffffffffa0fa8474>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [77051.378274] [<ffffffffa07fe6fd>] mdt_intent_getattr+0x29d/0x490 [mdt] [77051.378564] [<ffffffffa07fc26e>] mdt_intent_policy+0x3ae/0x750 [mdt] [77051.378858] [<ffffffffa0f5f70a>] ldlm_lock_enqueue+0x2ea/0x870 [ptlrpc] [77051.379163] [<ffffffffa0f85f07>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc] [77051.379460] [<ffffffffa07fc746>] mdt_enqueue+0x46/0x110 [mdt] [77051.379735] [<ffffffffa07f0732>] mdt_handle_common+0x8e2/0x1690 [mdt] [77051.380053] [<ffffffffa0825e95>] mds_regular_handle+0x15/0x20 [mdt] [77051.380358] [<ffffffffa0fb5af3>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] [77051.380835] [<ffffffffa0bbf65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [77051.381128] [<ffffffffa0fae871>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [77051.381421] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [77051.381696] [<ffffffffa0fb85fd>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] [77051.381992] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.382269] [<ffffffff8100c14a>] child_rip+0xa/0x20 [77051.382545] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.382839] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77051.383114] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [77051.383373] [77051.383565] LustreError: dumping log to /tmp/lustre-log.1359282032.29786 [77052.220055] LNet: Service thread pid 10930 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [77052.220826] LNet: Skipped 2 previous similar messages [77052.221069] Pid: 10930, comm: mdt00_010 [77052.221281] [77052.221281] Call Trace: [77052.221656] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [77052.221907] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [77052.222159] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [77052.222416] [<ffffffffa0bbf771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [77052.222703] [<ffffffffa0f7ed8d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [77052.222984] [<ffffffffa0f7a6a0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [77052.223414] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [77052.223690] [<ffffffffa0f7e4d8>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [77052.223974] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77052.224247] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77052.224511] [<ffffffffa07eadc0>] mdt_object_lock0+0x360/0xb10 [mdt] [77052.224799] [<ffffffffa07e8830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [77052.225083] [<ffffffffa0f7e8b0>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [77052.225502] [<ffffffffa07eb614>] mdt_object_lock+0x14/0x20 [mdt] [77052.225781] [<ffffffffa07fd9c5>] mdt_getattr_name_lock+0xee5/0x1980 [mdt] [77052.226094] [<ffffffffa0fa6275>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [77052.226413] [<ffffffffa0fd16f6>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [77052.226712] [<ffffffffa0fa8474>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [77052.227012] [<ffffffffa07fe6fd>] mdt_intent_getattr+0x29d/0x490 [mdt] [77052.227298] [<ffffffffa07fc26e>] mdt_intent_policy+0x3ae/0x750 [mdt] [77052.227577] [<ffffffffa0f5f70a>] ldlm_lock_enqueue+0x2ea/0x870 [ptlrpc] [77052.227855] [<ffffffffa0f85f07>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc] [77052.228229] [<ffffffffa07fc746>] mdt_enqueue+0x46/0x110 [mdt] [77052.228503] [<ffffffffa07f0732>] mdt_handle_common+0x8e2/0x1690 [mdt] [77052.228830] [<ffffffffa0825e95>] mds_regular_handle+0x15/0x20 [mdt] [77052.229154] [<ffffffffa0fb5af3>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] [77052.229758] [<ffffffffa0bbf65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [77052.230033] [<ffffffffa0fae871>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [77052.230322] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [77052.230615] [<ffffffffa0fb85fd>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] [77052.230880] [<ffffffffa01979c0>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc] [77052.231181] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77052.231489] [<ffffffff8100c14a>] child_rip+0xa/0x20 [77052.231737] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77052.232006] [<ffffffffa0fb7ac0>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] [77052.232260] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [77052.232498] [77052.232673] LustreError: dumping log to /tmp/lustre-log.1359282033.10930 [77052.235465] LNet: Service thread pid 29790 was inactive for 62.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [77052.252077] LNet: Service thread pid 29788 was inactive for 62.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [77052.252623] LNet: Skipped 2 previous similar messages [77052.252883] LustreError: dumping log to /tmp/lustre-log.1359282033.29788 [77052.268049] LustreError: dumping log to /tmp/lustre-log.1359282033.29785 ... [77189.148054] Lustre: lock timed out (enqueued at 1359281970, 200s ago) [77189.148058] Lustre: lock timed out (enqueued at 1359281970, 200s ago) [77189.168469] LustreError: dumping log to /tmp/lustre-log.1359282170.30749 [77189.304162] Lustre: lock timed out (enqueued at 1359281970, 200s ago) [77190.220049] Lustre: lock timed out (enqueued at 1359281971, 200s ago) [77190.220048] Lustre: lock timed out (enqueued at 1359281971, 200s ago) [77190.220048] Lustre: lock timed out (enqueued at 1359281971, 200s ago) [77190.220048] Lustre: Skipped 1 previous similar message [77190.220051] Lustre: Skipped 1 previous similar message |
| Comments |
| Comment by Oleg Drokin [ 29/Jan/13 ] |
|
I have dumped logs if they are needed |
| Comment by Di Wang [ 29/Jan/13 ] |
|
This problem might be related with the recent landing about rename lock mdt_reint_rename() if (!lu_fid_eq(rr->rr_fid1, rr->rr_fid2)) { } We probably should remove this eq check, because even though the target and source are in the same directory, we still need serialize the locking order of these two children to avoid deadlock. In 2.x, there are no other way to do that, we have to rely on this BFL lock. |
| Comment by Jodi Levi (Inactive) [ 29/Jan/13 ] |
|
Di is looking into this one. |
| Comment by Di Wang [ 29/Jan/13 ] |
| Comment by Oleg Drokin [ 31/Jan/13 ] |
|
Patch landed |