[LU-488] LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) ASSERTION(!cfs_hlist_unhashed(&conn->c_hash)) failed Created: 05/Jul/11 Updated: 28/Mar/12 Resolved: 25/Oct/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6/CHAOS5, Lustre 2.0.63-1chaos |
||
| Severity: | 3 |
| Rank (Obsolete): | 4876 |
| Description |
|
This LBUG was hit on an OSS node running Lustre version 2.0.63-1chaos with an additional patch adding some printk statements. The command that triggered the assertion: $ /etc/init.d/lustre stop Unmounting /mnt/lustre/local/lc2-OST0000 Unmounting /mnt/lustre/local/lc2-OST0004 Unmounting /mnt/lustre/local/lc2-OST0008 Unmounting /mnt/lustre/local/lc2-OST000c Unmounting /mnt/lustre/local/lc2-OST0010 Unmounting /mnt/lustre/local/lc2-OST0014 Unmounting /mnt/lustre/local/lc2-OST0018 Unmounting /mnt/lustre/local/lc2-OST001c Unmounting /mnt/lustre/local/lc2-OST0020 Unmounting /mnt/lustre/local/lc2-OST0024 Unmounting /mnt/lustre/local/lc2-OST0028 Console output: 2011-07-01 15:37:41 LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) ASSERTION(!cfs_hlist_unhashed(&conn->c_hash)) failed 2011-07-01 15:37:41 LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) LBUG 2011-07-01 15:37:41 Pid: 4564, comm: ll_ost_110 2011-07-01 15:37:41 Jul 1 15:37:40 2011-07-01 15:37:41 zwicky47 kernel:Call Trace: 2011-07-01 15:37:41 LustreError: 45 [<ffffffffa03218b5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2011-07-01 15:37:41 64:0:(connection [<ffffffffa0321ef5>] lbug_with_loc+0x75/0xe0 [libcfs] 2011-07-01 15:37:41 .c:100:ptlrpc_co [<ffffffffa032ccf6>] libcfs_assertion_failed+0x66/0x70 [libcfs] 2011-07-01 15:37:41 nnection_put()) [<ffffffffa04c3f3e>] ptlrpc_connection_put+0x22e/0x240 [ptlrpc] 2011-07-01 15:37:41 ASSERTION(!cfs_h [<ffffffffa04cd87c>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc] 2011-07-01 15:37:41 list_unhashed(&c [<ffffffffa04c4cd6>] ptlrpc_send_reply+0x286/0x6f0 [ptlrpc] 2011-07-01 15:37:41 onn->c_hash)) fa [<ffffffff812627f9>] ? cpumask_next_and+0x29/0x50 2011-07-01 15:37:41 iled 2011-07-01 15:37:41 Jul 1 15:3 [<ffffffffa049640a>] target_send_reply_msg+0x4a/0x160 [ptlrpc] 2011-07-01 15:37:41 7:40 zwicky47 ke [<ffffffffa04cca5c>] ? lustre_msg_get_version+0x7c/0xe0 [ptlrpc] 2011-07-01 15:37:41 rnel: LustreErro [<ffffffffa04965de>] target_send_reply+0xbe/0x600 [ptlrpc] 2011-07-01 15:37:41 r: 4564:0:(conne [<ffffffffa04caabc>] ? lustre_msg_set_versions+0x9c/0x130 [ptlrpc] 2011-07-01 15:37:41 ction.c:100:ptlr [<ffffffffa089c625>] ? oti_to_request+0x75/0xc0 [ost] 2011-07-01 15:37:41 pc_connection_pu [<ffffffffa08a5639>] ost_handle+0x279/0x4b90 [ost] 2011-07-01 15:37:41 t()) LBUG 2011-07-01 15:37:41 [<ffffffff8105dc32>] ? default_wake_function+0x12/0x20 2011-07-01 15:37:41 [<ffffffff8104aee9>] ? __wake_up_common+0x59/0x90 2011-07-01 15:37:41 [<ffffffffa04ca7f4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04db3ce>] ptlrpc_main+0xb8e/0x1900 [ptlrpc] 2011-07-01 15:37:41 [<ffffffff8100c1ca>] child_rip+0xa/0x20 2011-07-01 15:37:41 [<ffffffffa04da840>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] 2011-07-01 15:37:41 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 2011-07-01 15:37:41 2011-07-01 15:37:41 Kernel panic - not syncing: LBUG 2011-07-01 15:37:41 Pid: 4564, comm: ll_ost_110 Tainted: P ---------------- T 2.6.32-131.2.1.1chaos.ch5.x86_64 #1 2011-07-01 15:37:41 Jul 1 15:37:40 Call Trace: 2011-07-01 15:37:41 zwicky47 kernel: [<ffffffff814da4fe>] ? panic+0x78/0x143 2011-07-01 15:37:41 Kernel panic - [<ffffffffa0321f4b>] ? lbug_with_loc+0xcb/0xe0 [libcfs] 2011-07-01 15:37:41 not syncing: LBU [<ffffffffa032ccf6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs] 2011-07-01 15:37:41 G 2011-07-01 15:37:41 [<ffffffffa04c3f3e>] ? ptlrpc_connection_put+0x22e/0x240 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04cd87c>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04c4cd6>] ? ptlrpc_send_reply+0x286/0x6f0 [ptlrpc] 2011-07-01 15:37:41 [<ffffffff812627f9>] ? cpumask_next_and+0x29/0x50 2011-07-01 15:37:41 [<ffffffffa049640a>] ? target_send_reply_msg+0x4a/0x160 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04cca5c>] ? lustre_msg_get_version+0x7c/0xe0 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04965de>] ? target_send_reply+0xbe/0x600 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04caabc>] ? lustre_msg_set_versions+0x9c/0x130 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa089c625>] ? oti_to_request+0x75/0xc0 [ost] 2011-07-01 15:37:41 [<ffffffffa08a5639>] ? ost_handle+0x279/0x4b90 [ost] 2011-07-01 15:37:41 [<ffffffff8105dc32>] ? default_wake_function+0x12/0x20 2011-07-01 15:37:41 [<ffffffff8104aee9>] ? __wake_up_common+0x59/0x90 2011-07-01 15:37:41 [<ffffffffa04ca7f4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] 2011-07-01 15:37:41 [<ffffffffa04db3ce>] ? ptlrpc_main+0xb8e/0x1900 [ptlrpc] 2011-07-01 15:37:41 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 2011-07-01 15:37:41 [<ffffffffa04da840>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] 2011-07-01 15:37:41 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Peter Jones [ 06/Jul/11 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 06/Jul/11 ] |
|
Hi Prakash, could you get the kdump for such LBUG? Besides, before this LBUG, do you see any related error messages on the console? |
| Comment by Prakash Surya (Inactive) [ 07/Jul/11 ] |
|
Hi Lai, Here is some more info from the console directly before the LBUG was hit and reported. I'm not sure exactly what might be related and what might not so I'll just paste it all. The next message prior to these, happened at 15:30:12 and doesn't appear related. And, unfortunately we do not have a kdump for this LBUG. Jul 1 15:36:39 zwicky47 kernel: Lustre: Failing over lc2-OST0000 Jul 1 15:36:42 zwicky47 kernel: Lustre: lc2-OST0000: shutting down for failover; client state will be preserved. Jul 1 15:36:42 zwicky47 kernel: Lustre: OST lc2-OST0000 has stopped. Jul 1 15:36:45 zwicky47 kernel: Lustre: server umount lc2-OST0000 complete Jul 1 15:36:45 zwicky47 kernel: Lustre: Failing over lc2-OST0004 Jul 1 15:36:47 zwicky47 kernel: LustreError: 4553:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-107) req@ffff88040f835050 x1372811638229630/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1309559830 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1 Jul 1 15:36:47 zwicky47 kernel: LustreError: 4553:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 1 previous similar message Jul 1 15:36:47 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0004_UUID' is not available for connect (stopping) Jul 1 15:36:47 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0000_UUID' is not available for connect (no target) Jul 1 15:36:47 zwicky47 kernel: LustreError: 4560:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-19) req@ffff8802035d0800 x1372811638229679/t0(0) o-1-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1309559907 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1 Jul 1 15:36:48 zwicky47 kernel: Lustre: lc2-OST0004: shutting down for failover; client state will be preserved. Jul 1 15:36:48 zwicky47 kernel: Lustre: OST lc2-OST0004 has stopped. Jul 1 15:36:50 zwicky47 kernel: LustreError: 4562:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-107) req@ffff8801faf86000 x1373166731831251/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1309559833 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1 Jul 1 15:36:50 zwicky47 kernel: LustreError: 4562:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 4 previous similar messages Jul 1 15:36:50 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0004_UUID' is not available for connect (no target) Jul 1 15:36:50 zwicky47 kernel: LustreError: Skipped 1 previous similar message Jul 1 15:36:51 zwicky47 kernel: LustreError: 4560:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-19) req@ffff8801f9c6e000 x1373166734966880/t0(0) o-1-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1309559911 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1 Jul 1 15:36:51 zwicky47 kernel: LustreError: 4560:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 184 previous similar messages Jul 1 15:36:51 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0004_UUID' is not available for connect (no target) Jul 1 15:36:51 zwicky47 kernel: LustreError: Skipped 103 previous similar messages Jul 1 15:36:51 zwicky47 kernel: Lustre: server umount lc2-OST0004 complete Jul 1 15:36:51 zwicky47 kernel: Lustre: Failing over lc2-OST0008 Jul 1 15:36:54 zwicky47 kernel: Lustre: lc2-OST0008: shutting down for failover; client state will be preserved. Jul 1 15:36:54 zwicky47 kernel: Lustre: OST lc2-OST0008 has stopped. Jul 1 15:36:57 zwicky47 kernel: Lustre: server umount lc2-OST0008 complete Jul 1 15:36:57 zwicky47 kernel: Lustre: Failing over lc2-OST000c Jul 1 15:37:00 zwicky47 kernel: Lustre: lc2-OST000c: shutting down for failover; client state will be preserved. Jul 1 15:37:00 zwicky47 kernel: Lustre: OST lc2-OST000c has stopped. Jul 1 15:37:03 zwicky47 kernel: Lustre: server umount lc2-OST000c complete Jul 1 15:37:03 zwicky47 kernel: Lustre: Failing over lc2-OST0010 Jul 1 15:37:06 zwicky47 kernel: Lustre: lc2-OST0010: shutting down for failover; client state will be preserved. Jul 1 15:37:06 zwicky47 kernel: Lustre: OST lc2-OST0010 has stopped. Jul 1 15:37:09 zwicky47 kernel: Lustre: server umount lc2-OST0010 complete Jul 1 15:37:09 zwicky47 kernel: Lustre: Failing over lc2-OST0014 Jul 1 15:37:12 zwicky47 kernel: Lustre: lc2-OST0014: shutting down for failover; client state will be preserved. Jul 1 15:37:12 zwicky47 kernel: Lustre: OST lc2-OST0014 has stopped. Jul 1 15:37:12 zwicky47 kernel: LustreError: 4574:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-107) req@ffff88021ea1a400 x1372811638229685/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1309559855 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1 Jul 1 15:37:12 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0000_UUID' is not available for connect (no target) Jul 1 15:37:12 zwicky47 kernel: LustreError: Skipped 12 previous similar messages Jul 1 15:37:12 zwicky47 kernel: LustreError: 4574:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 48 previous similar messages Jul 1 15:37:15 zwicky47 kernel: Lustre: server umount lc2-OST0014 complete Jul 1 15:37:21 zwicky47 kernel: Lustre: Failing over lc2-OST001c Jul 1 15:37:21 zwicky47 kernel: Lustre: Skipped 1 previous similar message Jul 1 15:37:24 zwicky47 kernel: Lustre: lc2-OST001c: shutting down for failover; client state will be preserved. Jul 1 15:37:24 zwicky47 kernel: Lustre: Skipped 1 previous similar message Jul 1 15:37:24 zwicky47 kernel: Lustre: OST lc2-OST001c has stopped. Jul 1 15:37:24 zwicky47 kernel: Lustre: Skipped 1 previous similar message Jul 1 15:37:27 zwicky47 kernel: Lustre: server umount lc2-OST001c complete Jul 1 15:37:27 zwicky47 kernel: Lustre: Skipped 1 previous similar message Jul 1 15:37:37 zwicky47 kernel: LustreError: 4566:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (-107) req@ffff8802053c9c00 x1372811638229810/t0(0) o-1-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1309559880 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1 Jul 1 15:37:37 zwicky47 kernel: LustreError: 137-5: UUID 'lc2-OST0000_UUID' is not available for connect (no target) Jul 1 15:37:37 zwicky47 kernel: LustreError: Skipped 364 previous similar messages Jul 1 15:37:37 zwicky47 kernel: LustreError: 4566:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 612 previous similar messages Jul 1 15:37:39 zwicky47 kernel: Lustre: Failing over lc2-OST0028 Jul 1 15:37:39 zwicky47 kernel: Lustre: Skipped 2 previous similar messages Jul 1 15:37:40 zwicky47 kernel: LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) ASSERTION(!cfs_hlist_unhashed(&conn->c_hash)) failed Jul 1 15:37:40 zwicky47 kernel: LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) ASSERTION(!cfs_hlist_unhashed(&conn->c_hash)) failed Jul 1 15:37:40 zwicky47 kernel: LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) LBUG Jul 1 15:37:40 zwicky47 kernel: LustreError: 4564:0:(connection.c:100:ptlrpc_connection_put()) LBUG |
| Comment by Lai Siyao [ 08/Jul/11 ] |
|
Liang, is it possible that connection hash is being rehashed while ptlrpc_connection_put() is called? |
| Comment by Liang Zhen (Inactive) [ 08/Jul/11 ] |
|
Lai, yes I think it's because of rehashing, because we will never remove element from conn_hash before destroying the hash-table. We probably should assert cfs_atomic_read(&conn->c_refcount) > 1 instead of this. Liang |
| Comment by Lai Siyao [ 08/Jul/11 ] |
|
Review is on http://review.whamcloud.com/#change,1074 |
| Comment by Prakash Surya (Inactive) [ 11/Jul/11 ] |
|
Hi Lai, Is that patch all that is needed to fix this issue? |
| Comment by Lai Siyao [ 11/Jul/11 ] |
|
Yes, the log shows that at the time of ASSERT, there are clients which try to connect to OSTs, and connection hash may be rehashed at client connection, while connection will be removed from and added to hash again during rehashing, if ptlrpc_connection_put() assert on (!cfs_hlist_unhashed(&conn->c_hash)), it will fail (during this assert hash lock is not held). You may verify with this patch on your system. |
| Comment by Prakash Surya (Inactive) [ 12/Jul/11 ] |
|
Ok Thanks Lai! This has only come up once so far. Would repeatedly mounting and unmounting the OSTs be a good way to stress test this and see if it comes up? Do there also need to be clients trying to simultaneously read/write to the OST? |
| Comment by Lai Siyao [ 12/Jul/11 ] |
|
Yes, umount and mount OST repeatedly is necessary (and before umount recovery should have finished and clients issued some IO), and at the moment there should be new clients mounting lustre filesystem, because connection hash will only be rehashed when new connection is added. |
| Comment by Prakash Surya (Inactive) [ 25/Oct/11 ] |
|
Patch http://review.whamcloud.com/1074 has landed. I believe this can closed now. |
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Lai Siyao [ 25/Oct/11 ] |
|
No more work needed. |
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|
| Comment by Build Master (Inactive) [ 25/Oct/11 ] |
|
Integrated in Oleg Drokin : 89836033e22f37d021fae878144da479b741caad
|