[LU-3430] SWL failure: (hash.c:546:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: Created: 31/May/13  Updated: 06/Nov/13  Resolved: 24/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

LLNL/Hyperion


Severity: 3
Rank (Obsolete): 8505

 Description   

Running SWL test with NRS policy 'orr' after 25 hours OSS had LBUG, there were multiple assertions during the initial stack dump:

2013-05-30 13:12:46 LustreError: 5770:0:(hash.c:546:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed:
2013-05-30 13:12:46 LustreError: 5770:0:(hash.c:546:cfs_hash_bd_del_locked()) LBUG
2013-05-30 13:12:46 Pid: 5770, comm: ll_ost_io00_077
2013-05-30 13:12:46
2013-05-30 13:12:46 Call Trace:
2013-05-30 13:12:46  [<ffffffffa04d1895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-05-30 13:12:46 May 30 13:12:46  [<ffffffffa04d1e97>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-05-30 13:12:46 hyperion-dit33 k [<ffffffffa04e785a>] cfs_hash_bd_del_locked+0xda/0x140 [libcfs]
2013-05-30 13:12:46 ernel: LustreErr [<ffffffffa0a467e8>] nrs_orr_hop_put_free+0x218/0x290 [ptlrpc]
2013-05-30 13:12:46 or: 5770:0:(hash [<ffffffffa0a456d8>] nrs_orr_res_put+0x28/0x60 [ptlrpc]
2013-05-30 13:12:46 .c:546:cfs_hash_ [<ffffffffa0a3eb80>] nrs_resource_put_safe+0x60/0xf0 [ptlrpc]
2013-05-30 13:12:46 bd_del_locked()) [<ffffffffa0a3ec30>] ptlrpc_nrs_req_finalize+0x20/0x30 [ptlrpc]
2013-05-30 13:12:46  ASSERTION( bd->bd_bucket->hsb_c [<ffffffffa0a05a32>] ptlrpc_server_finish_active_request+0x62/0x150 [ptlrpc]
2013-05-30 13:12:46 ount > 0 ) faile [<ffffffffa0a0c1a2>] ptlrpc_server_handle_request+0x1b2/0xc60 [ptlrpc]
2013-05-30 13:12:46 d: 
2013-05-30 13:12:46 May 30 13:12 [<ffffffffa04d25de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2013-05-30 13:12:46 :46 hyperion-dit [<ffffffffa04e3d8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2013-05-30 13:12:46 33 kernel: Lustr [<ffffffffa0a036e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2013-05-30 13:12:46 eError: 5770:0:( [<ffffffffa0a0d71e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2013-05-30 13:12:46 hash.c:546:cfs_h [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:46 ash_bd_del_locke [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-05-30 13:12:46 d()) LBUG
2013-05-30 13:12:46  [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:46  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-05-30 13:12:46
2013-05-30 13:12:46 Kernel panic - not syncing: LBUG
2013-05-30 13:12:46 Pid: 5770, comm: ll_ost_io00_077 Tainted: P           ---------------    2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1
2013-05-30 13:12:46 Call Trace:
2013-05-30 13:12:46  [<ffffffff8150d878>] ? panic+0xa7/0x16f
2013-05-30 13:12:46 May 30 13:12:46  [<ffffffffa04d1eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-05-30 13:12:46 hyperion-dit33 k [<ffffffffa04e785a>] ? cfs_hash_bd_del_locked+0xda/0x140 [libcfs]
2013-05-30 13:12:46 ernel: Kernel pa [<ffffffffa0a467e8>] ? nrs_orr_hop_put_free+0x218/0x290 [ptlrpc]
2013-05-30 13:12:46 nic - not syncin [<ffffffffa0a456d8>] ? nrs_orr_res_put+0x28/0x60 [ptlrpc]
2013-05-30 13:12:46 g: LBUG
2013-05-30 13:12:46  [<ffffffffa0a3eb80>] ? nrs_resource_put_safe+0x60/0xf0 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa0a3ec30>] ? ptlrpc_nrs_req_finalize+0x20/0x30 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa0a05a32>] ? ptlrpc_server_finish_active_request+0x62/0x150 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa0a0c1a2>] ? ptlrpc_server_handle_request+0x1b2/0xc60 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa04d25de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2013-05-30 13:12:46  [<ffffffffa04e3d8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
2013-05-30 13:12:46  [<ffffffffa0a036e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2013-05-30 13:12:46  [<ffffffffa0a0d71e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
2013-05-30 13:12:47  [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:47  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2013-05-30 13:12:47  [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:47  [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2013-05-30 13:12:47  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-05-30 13:12:47 Initializing cgroup subsys cpuset


 Comments   
Comment by Nikitas Angelinas [ 03/Jun/13 ]

i think a check for oo_ref is missing in nrs_orr_hop_put_free(); i will try to look at this issue today or tomorrow.

Comment by Andreas Dilger [ 03/Sep/13 ]

Nikitas, any update on this problem?

Comment by Nikitas Angelinas [ 03/Sep/13 ]

Hi Andreas,

I have just been able to start working on this and other pending NRS issues, so I aim to have patches uploaded to Gerrit asap.

Comment by Nikitas Angelinas [ 11/Sep/13 ]

patch for master is at http://review.whamcloud.com/#/c/7623

Comment by Peter Jones [ 12/Sep/13 ]

Emoly

Can you please take care of this patch?

Thanks

Peter

Comment by Andreas Dilger [ 23/Sep/13 ]

Cliff, patch is landed to master along with http://review.whamcloud.com/7708 (LU-3978). Can you please confirm that this fixes the problems with NRS ORR policy? Hopefully we will also get some performance improvement now, maybe more noticeably for ZFS...

Comment by Andreas Dilger [ 24/Sep/13 ]

Patch is landed, closing bug.

Comment by Bob Glossman (Inactive) [ 04/Nov/13 ]

backport to b2_4: http://review.whamcloud.com/8162

Generated at Sat Feb 10 01:33:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.