[LU-2726] lprocfs_counter_sub()) ASSERTION( (!((stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE) == 0) || (!(((current_thread_info()->preempt_count) &... ) failed Created: 31/Jan/13  Updated: 08/Feb/13  Resolved: 08/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: HB

Severity: 3
Rank (Obsolete): 6626

 Description   

Probably caused by landing of http://review.whamcloud.com/3246 I now get this crash:

[ 5006.818563] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[ 5006.820054] LustreError: Skipped 1 previous similar message
[ 5006.820979] LustreError: 3302:0:(lvfs_lib.c:131:lprocfs_counter_sub()) ASSERTION( (!((stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE) == 0) || (!(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))))) ) failed: 
[ 5006.822543] LustreError: 3302:0:(lvfs_lib.c:131:lprocfs_counter_sub()) LBUG
[ 5006.822805] Kernel panic - not syncing: LBUG in interrupt.
[ 5006.822806] 
[ 5006.823205] Pid: 3302, comm: ldlm_elt Not tainted 2.6.32-debug #6
[ 5006.823447] Call Trace:
[ 5006.823635]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[ 5006.823872]  [<ffffffffa0449f5d>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
[ 5006.824123]  [<ffffffffa05a3027>] ? lprocfs_counter_sub+0x177/0x190 [lvfs]
[ 5006.824403]  [<ffffffffa07414ff>] ? ldlm_lock_put+0xcf/0x540 [ptlrpc]
[ 5006.824698]  [<ffffffffa0767fe0>] ? expired_lock_main+0x660/0x830 [ptlrpc]
[ 5006.825038]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[ 5006.825301]  [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc]
[ 5006.825555]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[ 5006.825802]  [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc]
[ 5006.826071]  [<ffffffffa0767980>] ? expired_lock_main+0x0/0x830 [ptlrpc]
[ 5006.826325]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I have a crashdump in /exports/crashdumps/192.168.10.223-2013-01-31-14\:11\:49/vmcore



 Comments   
Comment by John Hammond [ 31/Jan/13 ]

Please see http://review.whamcloud.com/5236.

Comment by Zhenyu Xu [ 31/Jan/13 ]

patch tracking http://review.whamcloud.com/5237

commit message
LU-2726 procfs: Invalid assertion for counter_{add,sub}

lprocfs_stats::lc_sum_irq is only used for obd_memory for accurate
memory usage counting, other procfs stats counters do not need it,
even though they can be used in RCU callbacks.
Comment by Jodi Levi (Inactive) [ 06/Feb/13 ]

Oleg is checking to see if Change, 5237 corrects the problem and can be landed.

Comment by Mikhail Pershin [ 07/Feb/13 ]

I've got another LBUG with this patch:

https://maloo.whamcloud.com/test_sessions/6c96e10e-716c-11e2-9241-52540035b04c

MDS console:

09:33:10:LNetError: 23499:0:(lib-lnet.h:489:lnet_me_alloc()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) || (((sizeof(*me))) <= (2 << 12) && (((CFS_ALLOC_IO)) & CFS_ALLOC_ATOMIC)) != 0 ) failed: 
09:33:10:LNetError: 23499:0:(lib-lnet.h:489:lnet_me_alloc()) LBUG
09:33:10:Kernel panic - not syncing: LBUG in interrupt.
09:33:10:
09:33:10:Pid: 23499, comm: mdt00_003 Not tainted 2.6.32-279.19.1.el6_lustre.gc7121e9.x86_64 #1
09:33:10:Call Trace:
09:33:10: [<ffffffff814e9811>] ? panic+0xa0/0x168
09:33:10: [<ffffffffa0448edd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
09:33:10: [<ffffffffa04cdffe>] ? lnet_me_alloc+0x1ce/0x1d0 [lnet]
09:33:10: [<ffffffffa04ce3a8>] ? LNetMEAttach+0x98/0x220 [lnet]
09:33:10: [<ffffffffa070d1e2>] ? ptlrpc_register_rqbd+0x82/0x390 [ptlrpc]
09:33:10: [<ffffffffa04495de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
09:33:10: [<ffffffffa045adcf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
09:33:10: [<ffffffffa071a129>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
09:33:10: [<ffffffffa04592e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
09:33:10: [<ffffffffa071adc5>] ? ptlrpc_server_post_idle_rqbds+0x75/0xe0 [ptlrpc]
09:33:10: [<ffffffffa07240a2>] ? ptlrpc_main+0xb92/0x19e0 [ptlrpc]
09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
09:33:10: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
09:33:10: [<ffffffffa0723510>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
09:33:10: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Zhenyu Xu [ 08/Feb/13 ]

this is another issue, deserves a new jira ticket.

lnet_me_alloc->LIBCFS_ALLOC->LIBCFS_ALLOC_GFP->LIBCFS_ALLOC_PRE

#define LIBCFS_ALLOC_PRE(size, mask)                                        \
do {                                                                        \
        LASSERT(!cfs_in_interrupt() ||                                      \
                ((size) <= LIBCFS_VMALLOC_SIZE &&                           \
                 ((mask) & CFS_ALLOC_ATOMIC)) != 0);                        \
} while (0)

I think it's alloc GFP mask issue.

/**
 * default allocator
 */
#define LIBCFS_ALLOC(ptr, size) \
        LIBCFS_ALLOC_GFP(ptr, size, CFS_ALLOC_IO)

/**
 * non-sleeping allocator
 */
#define LIBCFS_ALLOC_ATOMIC(ptr, size) \
        LIBCFS_ALLOC_GFP(ptr, size, CFS_ALLOC_ATOMIC)

Liang, could you take it a look?

Comment by Liang Zhen (Inactive) [ 08/Feb/13 ]

Hi Mike, I think the issue reported by you is because a defect in your patch, , please check my comment in your patch.

Comment by Mikhail Pershin [ 08/Feb/13 ]

Liang, thanks!

Comment by Oleg Drokin [ 08/Feb/13 ]

Patch landed, issue no longer happens in my testing

Generated at Sat Feb 10 01:27:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.