[LU-3430] SWL failure: (hash.c:546:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: Created: 31/May/13 Updated: 06/Nov/13 Resolved: 24/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
LLNL/Hyperion |
||
| Severity: | 3 |
| Rank (Obsolete): | 8505 |
| Description |
|
Running SWL test with NRS policy 'orr' after 25 hours OSS had LBUG, there were multiple assertions during the initial stack dump: 2013-05-30 13:12:46 LustreError: 5770:0:(hash.c:546:cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed: 2013-05-30 13:12:46 LustreError: 5770:0:(hash.c:546:cfs_hash_bd_del_locked()) LBUG 2013-05-30 13:12:46 Pid: 5770, comm: ll_ost_io00_077 2013-05-30 13:12:46 2013-05-30 13:12:46 Call Trace: 2013-05-30 13:12:46 [<ffffffffa04d1895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-05-30 13:12:46 May 30 13:12:46 [<ffffffffa04d1e97>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-05-30 13:12:46 hyperion-dit33 k [<ffffffffa04e785a>] cfs_hash_bd_del_locked+0xda/0x140 [libcfs] 2013-05-30 13:12:46 ernel: LustreErr [<ffffffffa0a467e8>] nrs_orr_hop_put_free+0x218/0x290 [ptlrpc] 2013-05-30 13:12:46 or: 5770:0:(hash [<ffffffffa0a456d8>] nrs_orr_res_put+0x28/0x60 [ptlrpc] 2013-05-30 13:12:46 .c:546:cfs_hash_ [<ffffffffa0a3eb80>] nrs_resource_put_safe+0x60/0xf0 [ptlrpc] 2013-05-30 13:12:46 bd_del_locked()) [<ffffffffa0a3ec30>] ptlrpc_nrs_req_finalize+0x20/0x30 [ptlrpc] 2013-05-30 13:12:46 ASSERTION( bd->bd_bucket->hsb_c [<ffffffffa0a05a32>] ptlrpc_server_finish_active_request+0x62/0x150 [ptlrpc] 2013-05-30 13:12:46 ount > 0 ) faile [<ffffffffa0a0c1a2>] ptlrpc_server_handle_request+0x1b2/0xc60 [ptlrpc] 2013-05-30 13:12:46 d: 2013-05-30 13:12:46 May 30 13:12 [<ffffffffa04d25de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-30 13:12:46 :46 hyperion-dit [<ffffffffa04e3d8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-30 13:12:46 33 kernel: Lustr [<ffffffffa0a036e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-30 13:12:46 eError: 5770:0:( [<ffffffffa0a0d71e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2013-05-30 13:12:46 hash.c:546:cfs_h [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:46 ash_bd_del_locke [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-05-30 13:12:46 d()) LBUG 2013-05-30 13:12:46 [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:46 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-05-30 13:12:46 2013-05-30 13:12:46 Kernel panic - not syncing: LBUG 2013-05-30 13:12:46 Pid: 5770, comm: ll_ost_io00_077 Tainted: P --------------- 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1 2013-05-30 13:12:46 Call Trace: 2013-05-30 13:12:46 [<ffffffff8150d878>] ? panic+0xa7/0x16f 2013-05-30 13:12:46 May 30 13:12:46 [<ffffffffa04d1eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-05-30 13:12:46 hyperion-dit33 k [<ffffffffa04e785a>] ? cfs_hash_bd_del_locked+0xda/0x140 [libcfs] 2013-05-30 13:12:46 ernel: Kernel pa [<ffffffffa0a467e8>] ? nrs_orr_hop_put_free+0x218/0x290 [ptlrpc] 2013-05-30 13:12:46 nic - not syncin [<ffffffffa0a456d8>] ? nrs_orr_res_put+0x28/0x60 [ptlrpc] 2013-05-30 13:12:46 g: LBUG 2013-05-30 13:12:46 [<ffffffffa0a3eb80>] ? nrs_resource_put_safe+0x60/0xf0 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa0a3ec30>] ? ptlrpc_nrs_req_finalize+0x20/0x30 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa0a05a32>] ? ptlrpc_server_finish_active_request+0x62/0x150 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa0a0c1a2>] ? ptlrpc_server_handle_request+0x1b2/0xc60 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa04d25de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-30 13:12:46 [<ffffffffa04e3d8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-30 13:12:46 [<ffffffffa0a036e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-30 13:12:46 [<ffffffffa0a0d71e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc] 2013-05-30 13:12:47 [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:47 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 2013-05-30 13:12:47 [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:47 [<ffffffffa0a0cc50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-05-30 13:12:47 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-05-30 13:12:47 Initializing cgroup subsys cpuset |
| Comments |
| Comment by Nikitas Angelinas [ 03/Jun/13 ] |
|
i think a check for oo_ref is missing in nrs_orr_hop_put_free(); i will try to look at this issue today or tomorrow. |
| Comment by Andreas Dilger [ 03/Sep/13 ] |
|
Nikitas, any update on this problem? |
| Comment by Nikitas Angelinas [ 03/Sep/13 ] |
|
Hi Andreas, I have just been able to start working on this and other pending NRS issues, so I aim to have patches uploaded to Gerrit asap. |
| Comment by Nikitas Angelinas [ 11/Sep/13 ] |
|
patch for master is at http://review.whamcloud.com/#/c/7623 |
| Comment by Peter Jones [ 12/Sep/13 ] |
|
Emoly Can you please take care of this patch? Thanks Peter |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
Cliff, patch is landed to master along with http://review.whamcloud.com/7708 ( |
| Comment by Andreas Dilger [ 24/Sep/13 ] |
|
Patch is landed, closing bug. |
| Comment by Bob Glossman (Inactive) [ 04/Nov/13 ] |
|
backport to b2_4: http://review.whamcloud.com/8162 |