[LU-948] Client recovery hang Created: 20/Dec/11 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4759 | ||||||||||||||||
| Description |
|
When I rebooted two OSS to put a patch for bug crash> bt 5077 PID: 5077 TASK: ffff88082da834c0 CPU: 8 COMMAND: "ptlrpcd-rcv" #0 [ffff88082da85430] schedule at ffffffff814ee3b2 #1 [ffff88082da854f8] io_schedule at ffffffff814eeba3 #2 [ffff88082da85518] sync_page at ffffffff81110fbd #3 [ffff88082da85528] __wait_on_bit_lock at ffffffff814ef40a #4 [ffff88082da85578] __lock_page at ffffffff81110f57 #5 [ffff88082da855d8] vvp_page_own at ffffffffa093bf6a [lustre] #6 [ffff88082da855f8] cl_page_own0 at ffffffffa0601d3b [obdclass] #7 [ffff88082da85678] cl_page_own at ffffffffa0601fa0 [obdclass] #8 [ffff88082da85688] cl_page_gang_lookup at ffffffffa0603bb7 [obdclass] #9 [ffff88082da85758] cl_lock_page_out at ffffffffa06096fc [obdclass] #10 [ffff88082da85808] osc_lock_flush at ffffffffa0858e8f [osc] #11 [ffff88082da85858] osc_lock_cancel at ffffffffa0858f2a [osc] #12 [ffff88082da858d8] cl_lock_cancel0 at ffffffffa0604665 [obdclass] #13 [ffff88082da85928] cl_lock_cancel at ffffffffa06051ab [obdclass] #14 [ffff88082da85968] osc_ldlm_blocking_ast at ffffffffa0859cf8 [osc] #15 [ffff88082da859f8] ldlm_cancel_callback at ffffffffa06a1ba3 [ptlrpc] #16 [ffff88082da85a18] ldlm_lock_cancel at ffffffffa06a1c89 [ptlrpc] #17 [ffff88082da85a58] ldlm_cli_cancel_list_local at ffffffffa06bede8 [ptlrpc] #18 [ffff88082da85ae8] ldlm_cancel_lru_local at ffffffffa06bf255 [ptlrpc] #19 [ffff88082da85b08] ldlm_replay_locks at ffffffffa06bf385 [ptlrpc] #20 [ffff88082da85bb8] ptlrpc_import_recovery_state_machine at ffffffffa070ceea [ptlrpc] #21 [ffff88082da85c38] ptlrpc_connect_interpret at ffffffffa070db38 [ptlrpc] #22 [ffff88082da85d08] ptlrpc_check_set at ffffffffa06dd870 [ptlrpc] #23 [ffff88082da85de8] ptlrpcd_check at ffffffffa07113b8 [ptlrpc] #24 [ffff88082da85e48] ptlrpcd at ffffffffa071175b [ptlrpc] #25 [ffff88082da85f48] kernel_thread at ffffffff8100c14a I will need to do more investigation, but thats a start. |
| Comments |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
Backtrace of all tasks on a hung client, and a dump_pages. |
| Comment by Peter Jones [ 20/Dec/11 ] |
|
Jinshan is looking into this |
| Comment by Jinshan Xiong (Inactive) [ 20/Dec/11 ] |
|
Do you have a chance to dump lustre log at clients? |
| Comment by Christopher Morrone [ 20/Dec/11 ] |
|
No, unfortunately someone rebooted the cluster out from under me. That was all that I could grab before the nodes went dead. |
| Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ] |
|
I've known the root cause of this problem. If there is a BRW RPC in the resending list, canceling the ldlm lock will cause deadlock since it will try to hold page lock. A workaround to fix this problem is by disabling ldlm_cancel_unused_locks_before_replay via procfs. I'm working on a fix. |
| Comment by Christopher Morrone [ 10/Jan/12 ] |
|
With |
| Comment by Jinshan Xiong (Inactive) [ 10/Jan/12 ] |
|
I think this issue can be fixed by reverting commit 6fd5e00ff03d41b427eec5d70efaef4bbdd8d59c which was added in bug 16774 to address client replaying lots of unused lock during recovery. Since this issue was filed and even implemented by you guys, can you please tell me what's the side effect for clients to replay unused lock during recovery? |
| Comment by Christopher Morrone [ 10/Jan/12 ] |
|
We only implemented some of the 1.8 version. The CLIO version for 2.X doesn't look at all familiar to me. In production, clients will need to replay tens of thousands of locks, which completely overwhelms the servers. Since most of those locks are completely usused, it is better to drop the unused locks rather than replay them. If they are needed again in the future, the load to recreate them on demand is easier to deal with than the flood of lock replays at recovery time. At the time we could even see the problem with just one or a few clients. If you did something like a linux kernel compilation out of lustre, you will wind up with tens of thousands of locks on just that one node. I don't think we really want to abandon this ability. |
| Comment by Jinshan Xiong (Inactive) [ 11/Jan/12 ] |
|
patch is at: http://review.whamcloud.com/1955 |
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 20/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 26/Jan/12 ] |
|
Landed for 2.2 |
| Comment by Bob Glossman (Inactive) [ 08/May/12 ] |
|
http://review.whamcloud.com/#change,2690 |
| Comment by Ned Bass [ 30/Aug/12 ] |
|
We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch. LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG PID: 3846 TASK: ffff88054af26aa0 CPU: 3 COMMAND: "ldlm_bl_10" #0 [ffff880567895948] machine_kexec at ffffffff8103216b #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12 #2 [ffff880567895a78] panic at ffffffff814ee999 #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs] #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs] #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre] #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass] #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass] #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass] #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass] #10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass] #11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc] #12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc] #13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass] #14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass] #15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc] #16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc] #17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc] #18 [ffff880567895f48] kernel_thread at ffffffff8100c14a |
| Comment by Jinshan Xiong (Inactive) [ 30/Aug/12 ] |
|
please apply patch at |
| Comment by Ned Bass [ 31/Aug/12 ] |
|
Thanks, that patch cleared up the vvp_page_unmap() assertion. We're now testing recovery with this tag https://github.com/chaos/lustre/commits/2.1.2-3chaos plus the following patches.
Unfortunately I just found another client deadlocked in recovery with a similar stack trace to the description. So I'm afraid http://review.whamcloud.com/#change,1955 i does not fix the problem. crash> bt 6257 PID: 6257 TASK: ffff8808317ccaa0 CPU: 4 COMMAND: "ptlrpcd-rcv" #0 [ffff88083084b680] schedule at ffffffff814ef052 #1 [ffff88083084b748] io_schedule at ffffffff814ef833 #2 [ffff88083084b768] sync_page at ffffffff81110d5d #3 [ffff88083084b778] __wait_on_bit_lock at ffffffff814f009a #4 [ffff88083084b7c8] __lock_page at ffffffff81110cf7 #5 [ffff88083084b828] vvp_page_own at ffffffffa09ce18a [lustre] #6 [ffff88083084b858] cl_page_own0 at ffffffffa05ace9b [obdclass] #7 [ffff88083084b8a8] cl_page_own at ffffffffa05ad070 [obdclass] #8 [ffff88083084b8b8] check_and_discard_cb at ffffffffa05b26af [obdclass] #9 [ffff88083084b908] cl_page_gang_lookup at ffffffffa05ae763 [obdclass] #10 [ffff88083084b9b8] cl_lock_page_out at ffffffffa05afe3b [obdclass] #11 [ffff88083084ba28] osc_lock_flush at ffffffffa08b697f [osc] #12 [ffff88083084ba78] osc_lock_cancel at ffffffffa08b6a19 [osc] #13 [ffff88083084bac8] cl_lock_cancel0 at ffffffffa05af085 [obdclass] #14 [ffff88083084baf8] cl_lock_cancel at ffffffffa05af8b3 [obdclass] #15 [ffff88083084bb18] osc_ldlm_blocking_ast at ffffffffa08b7673 [osc] #16 [ffff88083084bb88] ldlm_cancel_callback at ffffffffa06a0083 [ptlrpc] #17 [ffff88083084bba8] ldlm_lock_cancel at ffffffffa06a0165 [ptlrpc] #18 [ffff88083084bbc8] ldlm_cli_cancel_list_local at ffffffffa06b7698 [ptlrpc] #19 [ffff88083084bc28] ldlm_cancel_lru_local at ffffffffa06b7a95 [ptlrpc] #20 [ffff88083084bc48] ldlm_replay_locks at ffffffffa06b7b9f [ptlrpc] #21 [ffff88083084bcc8] ptlrpc_import_recovery_state_machine at ffffffffa06f5c26 [ptlrpc] #22 [ffff88083084bd08] ptlrpc_connect_interpret at ffffffffa06f65bd [ptlrpc] #23 [ffff88083084bda8] ptlrpc_check_set at ffffffffa06d0955 [ptlrpc] #24 [ffff88083084be48] ptlrpcd_check at ffffffffa06f8ee0 [ptlrpc] #25 [ffff88083084be78] ptlrpcd at ffffffffa06f91fb [ptlrpc] #26 [ffff88083084bf48] kernel_thread at ffffffff8100c14a |
| Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ] |
|
Hi Ned, can you please show me the backtrace of other processes when this issue happened? |
| Comment by Christopher Morrone [ 16/Oct/12 ] |
|
Jinshan, here is a console log showing all backtraces on the system. I'm not entirely certain if this will be useful, because this node was a PPC64 box running the older 2.1.0-13chaos code. 2.1.0-13chaos did NOT have the http://review.whamcloud.com/1955 fix. But the backtrace is the same for ptlrpcd-rcv, and you can see some other processes with backtraces under lustre ("cp" looks like an interesting one). I will attach seqlac2_console.txt. |
| Comment by Christopher Morrone [ 16/Oct/12 ] |
|
Attached seqlac2_console.txt showing backtraces of all processes on a PPC64 client. |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/12 ] |
|
Sorry for delay response. This must be a race in read ahead pages vs. dlm lock issue, I'm looking at it. |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Old ticket for unsupported version |