[LU-948] Client recovery hang Created: 20/Dec/11  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.2.0, Lustre 2.1.3

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: llnl

Attachments: File hype320_bt_pages.txt.bz2     Text File seqlac2_console.txt    
Issue Links:
Duplicate
is duplicated by LU-1066 Test failure on test suite replay-sin... Resolved
Related
is related to LU-1059 vvp_page_unmap()) ASSERTION(PageLocke... Resolved
Severity: 3
Rank (Obsolete): 4759

 Description   

When I rebooted two OSS to put a patch for bug LU-874 on the servers, quite a few of the clients have appear to have gotten deadlocked in recovery. Here's a backtrace of ptlrpcd-rcv on on client:

crash> bt 5077
PID: 5077   TASK: ffff88082da834c0  CPU: 8   COMMAND: "ptlrpcd-rcv"
 #0 [ffff88082da85430] schedule at ffffffff814ee3b2
 #1 [ffff88082da854f8] io_schedule at ffffffff814eeba3
 #2 [ffff88082da85518] sync_page at ffffffff81110fbd
 #3 [ffff88082da85528] __wait_on_bit_lock at ffffffff814ef40a
 #4 [ffff88082da85578] __lock_page at ffffffff81110f57
 #5 [ffff88082da855d8] vvp_page_own at ffffffffa093bf6a [lustre]
 #6 [ffff88082da855f8] cl_page_own0 at ffffffffa0601d3b [obdclass]
 #7 [ffff88082da85678] cl_page_own at ffffffffa0601fa0 [obdclass]
 #8 [ffff88082da85688] cl_page_gang_lookup at ffffffffa0603bb7 [obdclass]
 #9 [ffff88082da85758] cl_lock_page_out at ffffffffa06096fc [obdclass]
#10 [ffff88082da85808] osc_lock_flush at ffffffffa0858e8f [osc]
#11 [ffff88082da85858] osc_lock_cancel at ffffffffa0858f2a [osc]
#12 [ffff88082da858d8] cl_lock_cancel0 at ffffffffa0604665 [obdclass]
#13 [ffff88082da85928] cl_lock_cancel at ffffffffa06051ab [obdclass]
#14 [ffff88082da85968] osc_ldlm_blocking_ast at ffffffffa0859cf8 [osc]
#15 [ffff88082da859f8] ldlm_cancel_callback at ffffffffa06a1ba3 [ptlrpc]
#16 [ffff88082da85a18] ldlm_lock_cancel at ffffffffa06a1c89 [ptlrpc]
#17 [ffff88082da85a58] ldlm_cli_cancel_list_local at ffffffffa06bede8 [ptlrpc]
#18 [ffff88082da85ae8] ldlm_cancel_lru_local at ffffffffa06bf255 [ptlrpc]
#19 [ffff88082da85b08] ldlm_replay_locks at ffffffffa06bf385 [ptlrpc]
#20 [ffff88082da85bb8] ptlrpc_import_recovery_state_machine at ffffffffa070ceea [ptlrpc]
#21 [ffff88082da85c38] ptlrpc_connect_interpret at ffffffffa070db38 [ptlrpc]
#22 [ffff88082da85d08] ptlrpc_check_set at ffffffffa06dd870 [ptlrpc]
#23 [ffff88082da85de8] ptlrpcd_check at ffffffffa07113b8 [ptlrpc]
#24 [ffff88082da85e48] ptlrpcd at ffffffffa071175b [ptlrpc]
#25 [ffff88082da85f48] kernel_thread at ffffffff8100c14a

I will need to do more investigation, but thats a start.



 Comments   
Comment by Christopher Morrone [ 20/Dec/11 ]

Backtrace of all tasks on a hung client, and a dump_pages.

Comment by Peter Jones [ 20/Dec/11 ]

Jinshan is looking into this

Comment by Jinshan Xiong (Inactive) [ 20/Dec/11 ]

Do you have a chance to dump lustre log at clients?

Comment by Christopher Morrone [ 20/Dec/11 ]

No, unfortunately someone rebooted the cluster out from under me. That was all that I could grab before the nodes went dead.

Comment by Jinshan Xiong (Inactive) [ 21/Dec/11 ]

I've known the root cause of this problem. If there is a BRW RPC in the resending list, canceling the ldlm lock will cause deadlock since it will try to hold page lock. A workaround to fix this problem is by disabling ldlm_cancel_unused_locks_before_replay via procfs.

I'm working on a fix.

Comment by Christopher Morrone [ 10/Jan/12 ]

With LU-874 much improved, this is our next concern for 2.1. Have you had any time to look at the fix for this?

Comment by Jinshan Xiong (Inactive) [ 10/Jan/12 ]

I think this issue can be fixed by reverting commit 6fd5e00ff03d41b427eec5d70efaef4bbdd8d59c which was added in bug 16774 to address client replaying lots of unused lock during recovery.

Since this issue was filed and even implemented by you guys, can you please tell me what's the side effect for clients to replay unused lock during recovery?

Comment by Christopher Morrone [ 10/Jan/12 ]

We only implemented some of the 1.8 version. The CLIO version for 2.X doesn't look at all familiar to me.

In production, clients will need to replay tens of thousands of locks, which completely overwhelms the servers. Since most of those locks are completely usused, it is better to drop the unused locks rather than replay them. If they are needed again in the future, the load to recreate them on demand is easier to deal with than the flood of lock replays at recovery time.

At the time we could even see the problem with just one or a few clients. If you did something like a linux kernel compilation out of lustre, you will wind up with tens of thousands of locks on just that one node.

I don't think we really want to abandon this ability.

Comment by Jinshan Xiong (Inactive) [ 11/Jan/12 ]

patch is at: http://review.whamcloud.com/1955

Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_internal.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_internal.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_lock.c
  • lustre/include/cl_object.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_internal.h
  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/include/cl_object.h
  • lustre/obdclass/cl_internal.h
  • lustre/obdclass/cl_page.c
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/include/cl_object.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_internal.h
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_internal.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_internal.h
  • lustre/obdclass/cl_lock.c
  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/obdclass/cl_page.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_internal.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/osc/osc_lock.c
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/obdclass/cl_page.c
  • lustre/include/cl_object.h
  • lustre/osc/osc_lock.c
  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
Comment by Build Master (Inactive) [ 20/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #440
LU-948 clio: add a callback to cl_page_gang_lookup() (Revision 7076eff5cd415472061a26c897469dd5b8174861)

Result = SUCCESS
Oleg Drokin : 7076eff5cd415472061a26c897469dd5b8174861
Files :

  • lustre/obdclass/cl_lock.c
  • lustre/obdclass/cl_internal.h
  • lustre/include/cl_object.h
  • lustre/obdclass/cl_page.c
  • lustre/osc/osc_lock.c
Comment by Peter Jones [ 26/Jan/12 ]

Landed for 2.2

Comment by Bob Glossman (Inactive) [ 08/May/12 ]

http://review.whamcloud.com/#change,2690
back port to b2_1

Comment by Ned Bass [ 30/Aug/12 ]

We hit this assertion during testing after cherry-picking http://review.whamcloud.com/#change,1955 into our 2.1.2 branch.

LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) ASSERTION(PageLocked(vmpage)) failed
LustreError: 3846:0:(vvp_page.c:167:vvp_page_unmap()) LBUG

PID: 3846   TASK: ffff88054af26aa0  CPU: 3   COMMAND: "ldlm_bl_10"
 #0 [ffff880567895948] machine_kexec at ffffffff8103216b
 #1 [ffff8805678959a8] crash_kexec at ffffffff810b8d12
 #2 [ffff880567895a78] panic at ffffffff814ee999
 #3 [ffff880567895af8] lbug_with_loc at ffffffffa0515e1b [libcfs]
 #4 [ffff880567895b18] libcfs_assertion_failed at ffffffffa051f42d [libcfs]
 #5 [ffff880567895b38] vvp_page_unmap at ffffffffa0aebc8c [lustre]
 #6 [ffff880567895b68] cl_page_invoke at ffffffffa06982f8 [obdclass]
 #7 [ffff880567895ba8] cl_page_unmap at ffffffffa0698383 [obdclass]
 #8 [ffff880567895bb8] check_and_discard_cb at ffffffffa069f6be [obdclass]
 #9 [ffff880567895c08] cl_page_gang_lookup at ffffffffa069b763 [obdclass]
#10 [ffff880567895cb8] cl_lock_page_out at ffffffffa069ce3b [obdclass]
#11 [ffff880567895d28] osc_lock_flush at ffffffffa09c197f [osc]
#12 [ffff880567895d78] osc_lock_cancel at ffffffffa09c1a19 [osc]
#13 [ffff880567895dc8] cl_lock_cancel0 at ffffffffa069c085 [obdclass]
#14 [ffff880567895df8] cl_lock_cancel at ffffffffa069c8b3 [obdclass]
#15 [ffff880567895e18] osc_ldlm_blocking_ast at ffffffffa09c2673 [osc]
#16 [ffff880567895e88] ldlm_handle_bl_callback at ffffffffa07a7db4 [ptlrpc]
#17 [ffff880567895eb8] ldlm_bl_thread_main at ffffffffa07a8139 [ptlrpc]
#18 [ffff880567895f48] kernel_thread at ffffffff8100c14a
Comment by Jinshan Xiong (Inactive) [ 30/Aug/12 ]

please apply patch at LU-1059

Comment by Ned Bass [ 31/Aug/12 ]

Thanks, that patch cleared up the vvp_page_unmap() assertion. We're now testing recovery with this tag https://github.com/chaos/lustre/commits/2.1.2-3chaos plus the following patches.

  • c882f3a LU-1059 clio: to not try to discard freeing pages
  • 9286c90 LU-948 clio: add a callback to cl_page_gang_lookup()
  • 82a8948 LU-1716 ptlrpc: Race in updating of connection flags on client.
  • 8812d32 LU-1299 clio: set the lock to CLS_NEW if unuse returns error

Unfortunately I just found another client deadlocked in recovery with a similar stack trace to the description. So I'm afraid http://review.whamcloud.com/#change,1955 i does not fix the problem.

crash> bt 6257
PID: 6257   TASK: ffff8808317ccaa0  CPU: 4   COMMAND: "ptlrpcd-rcv"
 #0 [ffff88083084b680] schedule at ffffffff814ef052
 #1 [ffff88083084b748] io_schedule at ffffffff814ef833
 #2 [ffff88083084b768] sync_page at ffffffff81110d5d
 #3 [ffff88083084b778] __wait_on_bit_lock at ffffffff814f009a
 #4 [ffff88083084b7c8] __lock_page at ffffffff81110cf7
 #5 [ffff88083084b828] vvp_page_own at ffffffffa09ce18a [lustre]
 #6 [ffff88083084b858] cl_page_own0 at ffffffffa05ace9b [obdclass]
 #7 [ffff88083084b8a8] cl_page_own at ffffffffa05ad070 [obdclass]
 #8 [ffff88083084b8b8] check_and_discard_cb at ffffffffa05b26af [obdclass]
 #9 [ffff88083084b908] cl_page_gang_lookup at ffffffffa05ae763 [obdclass]
#10 [ffff88083084b9b8] cl_lock_page_out at ffffffffa05afe3b [obdclass]
#11 [ffff88083084ba28] osc_lock_flush at ffffffffa08b697f [osc]
#12 [ffff88083084ba78] osc_lock_cancel at ffffffffa08b6a19 [osc]
#13 [ffff88083084bac8] cl_lock_cancel0 at ffffffffa05af085 [obdclass]
#14 [ffff88083084baf8] cl_lock_cancel at ffffffffa05af8b3 [obdclass]
#15 [ffff88083084bb18] osc_ldlm_blocking_ast at ffffffffa08b7673 [osc]
#16 [ffff88083084bb88] ldlm_cancel_callback at ffffffffa06a0083 [ptlrpc]
#17 [ffff88083084bba8] ldlm_lock_cancel at ffffffffa06a0165 [ptlrpc]
#18 [ffff88083084bbc8] ldlm_cli_cancel_list_local at ffffffffa06b7698 [ptlrpc]
#19 [ffff88083084bc28] ldlm_cancel_lru_local at ffffffffa06b7a95 [ptlrpc]
#20 [ffff88083084bc48] ldlm_replay_locks at ffffffffa06b7b9f [ptlrpc]
#21 [ffff88083084bcc8] ptlrpc_import_recovery_state_machine at ffffffffa06f5c26 [ptlrpc]
#22 [ffff88083084bd08] ptlrpc_connect_interpret at ffffffffa06f65bd [ptlrpc]
#23 [ffff88083084bda8] ptlrpc_check_set at ffffffffa06d0955 [ptlrpc]
#24 [ffff88083084be48] ptlrpcd_check at ffffffffa06f8ee0 [ptlrpc]
#25 [ffff88083084be78] ptlrpcd at ffffffffa06f91fb [ptlrpc]
#26 [ffff88083084bf48] kernel_thread at ffffffff8100c14a
Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ]

Hi Ned, can you please show me the backtrace of other processes when this issue happened?

Comment by Christopher Morrone [ 16/Oct/12 ]

Jinshan, here is a console log showing all backtraces on the system. I'm not entirely certain if this will be useful, because this node was a PPC64 box running the older 2.1.0-13chaos code. 2.1.0-13chaos did NOT have the http://review.whamcloud.com/1955 fix.

But the backtrace is the same for ptlrpcd-rcv, and you can see some other processes with backtraces under lustre ("cp" looks like an interesting one). I will attach seqlac2_console.txt.

Comment by Christopher Morrone [ 16/Oct/12 ]

Attached seqlac2_console.txt showing backtraces of all processes on a PPC64 client.

Comment by Jinshan Xiong (Inactive) [ 21/Oct/12 ]

Sorry for delay response. This must be a race in read ahead pages vs. dlm lock issue, I'm looking at it.

Comment by James A Simmons [ 16/Aug/16 ]

Old ticket for unsupported version

Generated at Sat Feb 10 01:11:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.