[LU-3936] ldlm_cancel_stale_locks()) ASSERTION( count > 0 ) failed Created: 12/Sep/13  Updated: 20/Nov/13  Resolved: 20/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 10409

 Description   

Aug 17 18:18:49 snx11003n003 kernel: [873893.844231] LustreError: 80225:0:(ldlm_lock.c:1792:ldlm_cancel_stale_locks()) ASSERTION( count > 0 ) failed:
Aug 17 18:18:49 snx11003n003 kernel: [873893.855652] LustreError: 80225:0:(ldlm_lock.c:1792:ldlm_cancel_stale_locks()) LBUG
Aug 17 18:18:49 snx11003n003 kernel: [873893.864399] Pid: 80225, comm: mdt_rdpg_84
Aug 17 18:18:49 snx11003n003 kernel: [873893.869077]
Aug 17 18:18:49 snx11003n003 kernel: [873893.869078] Call Trace:
Aug 17 18:18:49 snx11003n003 kernel: [873893.873894] [<ffffffffa0467865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 17 18:18:49 snx11003n003 kernel: [873893.881895] [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 17 18:18:49 snx11003n003 kernel: [873893.889058] [<ffffffffa06e8ecd>] ldlm_cancel_stale_locks+0x17d/0x180 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873893.897439] [<ffffffffa0477884>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
Aug 17 18:18:49 snx11003n003 kernel: [873893.905854] [<ffffffffa07195ec>] ldlm_srv_pool_shrink+0x7c/0x100 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873893.913795] [<ffffffffa0718b97>] ldlm_pool_shrink+0x37/0xf0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873893.921254] [<ffffffffa07198f3>] ldlm_pools_shrink+0x283/0x330 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873893.928995] [<ffffffffa07199d3>] ldlm_pools_srv_shrink+0x13/0x20 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873893.936901] [<ffffffff81125d5a>] shrink_slab+0x13a/0x1a0
Aug 17 18:18:49 snx11003n003 kernel: [873893.943140] [<ffffffff81128934>] zone_reclaim+0x284/0x410
Aug 17 18:18:49 snx11003n003 kernel: [873893.949469] [<ffffffff81129640>] ? isolate_pages_global+0x0/0x380
Aug 17 18:18:49 snx11003n003 kernel: [873893.956582] [<ffffffff8111ef74>] get_page_from_freelist+0x694/0x820
Aug 17 18:18:49 snx11003n003 kernel: [873893.963887] [<ffffffff8111fe81>] __alloc_pages_nodemask+0x111/0x8b0
Aug 17 18:18:49 snx11003n003 kernel: [873893.971187] [<ffffffff8126c5a9>] ? pointer+0xa9/0x900
Aug 17 18:18:49 snx11003n003 kernel: [873893.977130] [<ffffffffa08cfaea>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd]
Aug 17 18:18:49 snx11003n003 kernel: [873893.984824] [<ffffffff81159d72>] kmem_getpages+0x62/0x170
Aug 17 18:18:49 snx11003n003 kernel: [873893.991152] [<ffffffff8115a3df>] cache_grow+0x2cf/0x320
Aug 17 18:18:49 snx11003n003 kernel: [873893.997287] [<ffffffff8115a632>] cache_alloc_refill+0x202/0x240
Aug 17 18:18:49 snx11003n003 kernel: [873894.004215] [<ffffffffa0468a03>] ? cfs_alloc+0x63/0x90 [libcfs]
Aug 17 18:18:49 snx11003n003 kernel: [873894.011138] [<ffffffff8115b339>] __kmalloc+0x1b9/0x230
Aug 17 18:18:49 snx11003n003 kernel: [873894.017188] [<ffffffffa0468a03>] cfs_alloc+0x63/0x90 [libcfs]
Aug 17 18:18:49 snx11003n003 kernel: [873894.023969] [<ffffffffa0772e3f>] null_alloc_rs+0x16f/0x3b0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.031314] [<ffffffffa0760544>] sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.039237] [<ffffffffa0732393>] lustre_pack_reply_v2+0x93/0x280 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.047158] [<ffffffffa0734f20>] ? lustre_swab_mdt_rec_reint+0x0/0xb0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.055667] [<ffffffffa0732636>] lustre_pack_reply_flags+0xb6/0x210 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.063978] [<ffffffffa07327a1>] lustre_pack_reply+0x11/0x20 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.071527] [<ffffffffa075df93>] req_capsule_server_pack+0x53/0x120 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.079807] [<ffffffffa0c6fecb>] mdt_close+0x10b/0x850 [mdt]
Aug 17 18:18:49 snx11003n003 kernel: [873894.086494] [<ffffffffa07330ec>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.094879] [<ffffffffa0c48a02>] mdt_handle_common+0x932/0x1770 [mdt]
Aug 17 18:18:49 snx11003n003 kernel: [873894.102391] [<ffffffffa0c498f5>] mdt_readpage_handle+0x15/0x20 [mdt]
Aug 17 18:18:49 snx11003n003 kernel: [873894.109838] [<ffffffffa0742b83>] ptlrpc_main+0xf13/0x19e0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.117077] [<ffffffffa0741c70>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.124286] [<ffffffff8100c1ca>] child_rip+0xa/0x20
Aug 17 18:18:49 snx11003n003 kernel: [873894.130075] [<ffffffffa0741c70>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.137310] [<ffffffffa0741c70>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
Aug 17 18:18:49 snx11003n003 kernel: [873894.144517] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20



 Comments   
Comment by Andriy Skulysh [ 12/Sep/13 ]

patch: http://review.whamcloud.com/#/c/7626/

Comment by Dmitry Eremin (Inactive) [ 24/Oct/13 ]

I'm not sure this issue is related to 2.5 code. There is no ldlm_cancel_stale_locks() function at all. Could you specify the real version of Lustre which got this assertion please?

Comment by Andriy Skulysh [ 24/Oct/13 ]

It was caught on Lustre 2.1, but it doesn't matter because ldlm_pool_shrink() and others are called with negative number of locks to cancel

Comment by Dmitry Eremin (Inactive) [ 24/Oct/13 ]

Hmm. Could you provide a reproducer please? I agree the expression "1 + nr_locks * nr / total" can potentially overflow int32 but I try to understand why this cause a crash you referring to.

Comment by Andreas Dilger [ 25/Oct/13 ]

Andriy,
can you please explain further how it is possible for there to be more than 2^31 locks on any node? This would require 2TB of RAM to keep that many locks in memory and other parts of the code would completely explode trying to deal with that many locks. I think the approach of using __u64 for counting locks to be completely bogus. This is clearly a case of some underflow that is caused by a negative number, not an overflow. If the root of the problem is that "nr" is negative in some kernels, then changing to __u64 just means that the node will try to cancel 2^63 locks or something else bad, and not fix the root of the problem.

Secondly, there is no ldlm_cancel_stale_locks() that I can find in either master or in 2.1, nor could I find the above LASSERT(count > 0) in some other function. Could you please tell me which specific version of Lustre this is in, or is this in some patch in Gerrit that is not landed yet?

I think I was incorrect in approving the original patch for this problem, because I didn't actually look closely enough at this bug when inspecting the code. I can't see how that patch actually fixes any problem.

Comment by Andreas Dilger [ 25/Oct/13 ]

My bad. I see that there is an integer overflow if "nr" is large, so the original patch is not useless.

I'm not yet sure what Dmitry's patch http://review.whamcloud.com/8075 is doing, but we shouldn't close this bug while it is still open.

Comment by Dmitry Eremin (Inactive) [ 20/Nov/13 ]

There are no more concerns, therefore I close the ticket.

Generated at Sat Feb 10 01:38:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.