[LU-2726] lprocfs_counter_sub()) ASSERTION( (!((stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE) == 0) || (!(((current_thread_info()->preempt_count) &... ) failed Created: 31/Jan/13 Updated: 08/Feb/13 Resolved: 08/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Severity: | 3 |
| Rank (Obsolete): | 6626 |
| Description |
|
Probably caused by landing of http://review.whamcloud.com/3246 I now get this crash: [ 5006.818563] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107 [ 5006.820054] LustreError: Skipped 1 previous similar message [ 5006.820979] LustreError: 3302:0:(lvfs_lib.c:131:lprocfs_counter_sub()) ASSERTION( (!((stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE) == 0) || (!(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))))) ) failed: [ 5006.822543] LustreError: 3302:0:(lvfs_lib.c:131:lprocfs_counter_sub()) LBUG [ 5006.822805] Kernel panic - not syncing: LBUG in interrupt. [ 5006.822806] [ 5006.823205] Pid: 3302, comm: ldlm_elt Not tainted 2.6.32-debug #6 [ 5006.823447] Call Trace: [ 5006.823635] [<ffffffff814f75e4>] ? panic+0xa0/0x168 [ 5006.823872] [<ffffffffa0449f5d>] ? lbug_with_loc+0x8d/0xb0 [libcfs] [ 5006.824123] [<ffffffffa05a3027>] ? lprocfs_counter_sub+0x177/0x190 [lvfs] [ 5006.824403] [<ffffffffa07414ff>] ? ldlm_lock_put+0xcf/0x540 [ptlrpc] [ 5006.824698] [<ffffffffa0767fe0>] ? expired_lock_main+0x660/0x830 [ptlrpc] [ 5006.825038] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 5006.825301] [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc] [ 5006.825555] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 5006.825802] [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc] [ 5006.826071] [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc] [ 5006.826325] [<ffffffff8100c140>] ? child_rip+0x0/0x20 I have a crashdump in /exports/crashdumps/192.168.10.223-2013-01-31-14\:11\:49/vmcore |
| Comments |
| Comment by John Hammond [ 31/Jan/13 ] |
|
Please see http://review.whamcloud.com/5236. |
| Comment by Zhenyu Xu [ 31/Jan/13 ] |
|
patch tracking http://review.whamcloud.com/5237 commit message LU-2726 procfs: Invalid assertion for counter_{add,sub}
lprocfs_stats::lc_sum_irq is only used for obd_memory for accurate
memory usage counting, other procfs stats counters do not need it,
even though they can be used in RCU callbacks.
|
| Comment by Jodi Levi (Inactive) [ 06/Feb/13 ] |
|
Oleg is checking to see if Change, 5237 corrects the problem and can be landed. |
| Comment by Mikhail Pershin [ 07/Feb/13 ] |
|
I've got another LBUG with this patch: https://maloo.whamcloud.com/test_sessions/6c96e10e-716c-11e2-9241-52540035b04c MDS console: 09:33:10:LNetError: 23499:0:(lib-lnet.h:489:lnet_me_alloc()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) || (((sizeof(*me))) <= (2 << 12) && (((CFS_ALLOC_IO)) & CFS_ALLOC_ATOMIC)) != 0 ) failed: 09:33:10:LNetError: 23499:0:(lib-lnet.h:489:lnet_me_alloc()) LBUG 09:33:10:Kernel panic - not syncing: LBUG in interrupt. 09:33:10: 09:33:10:Pid: 23499, comm: mdt00_003 Not tainted 2.6.32-279.19.1.el6_lustre.gc7121e9.x86_64 #1 09:33:10:Call Trace: 09:33:10: [<ffffffff814e9811>] ? panic+0xa0/0x168 09:33:10: [<ffffffffa0448edd>] ? lbug_with_loc+0x8d/0xb0 [libcfs] 09:33:10: [<ffffffffa04cdffe>] ? lnet_me_alloc+0x1ce/0x1d0 [lnet] 09:33:10: [<ffffffffa04ce3a8>] ? LNetMEAttach+0x98/0x220 [lnet] 09:33:10: [<ffffffffa070d1e2>] ? ptlrpc_register_rqbd+0x82/0x390 [ptlrpc] 09:33:10: [<ffffffffa04495de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 09:33:10: [<ffffffffa045adcf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 09:33:10: [<ffffffffa071a129>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 09:33:10: [<ffffffffa04592e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 09:33:10: [<ffffffffa071adc5>] ? ptlrpc_server_post_idle_rqbds+0x75/0xe0 [ptlrpc] 09:33:10: [<ffffffffa07240a2>] ? ptlrpc_main+0xb92/0x19e0 [ptlrpc] 09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 09:33:10: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 09:33:10: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Zhenyu Xu [ 08/Feb/13 ] |
|
this is another issue, deserves a new jira ticket. lnet_me_alloc->LIBCFS_ALLOC->LIBCFS_ALLOC_GFP->LIBCFS_ALLOC_PRE #define LIBCFS_ALLOC_PRE(size, mask) \ do { \ LASSERT(!cfs_in_interrupt() || \ ((size) <= LIBCFS_VMALLOC_SIZE && \ ((mask) & CFS_ALLOC_ATOMIC)) != 0); \ } while (0) I think it's alloc GFP mask issue.
/**
* default allocator
*/
#define LIBCFS_ALLOC(ptr, size) \
LIBCFS_ALLOC_GFP(ptr, size, CFS_ALLOC_IO)
/**
* non-sleeping allocator
*/
#define LIBCFS_ALLOC_ATOMIC(ptr, size) \
LIBCFS_ALLOC_GFP(ptr, size, CFS_ALLOC_ATOMIC)
Liang, could you take it a look? |
| Comment by Liang Zhen (Inactive) [ 08/Feb/13 ] |
|
Hi Mike, I think the issue reported by you is because a defect in your patch, |
| Comment by Mikhail Pershin [ 08/Feb/13 ] |
|
Liang, thanks! |
| Comment by Oleg Drokin [ 08/Feb/13 ] |
|
Patch landed, issue no longer happens in my testing |