[LU-8391] ASSERTION( list_empty(&lock->l_res_link) ) Created: 13/Jul/16  Updated: 01/Aug/19  Resolved: 29/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Li Dongyang (Inactive) Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-11223 Changed resource in completion ast Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lustre client crashed after eviction:

<3>LustreError: 20358:0:(ldlm_lockd.c:1797:ldlm_handle_cp_callback()) change resource!
<3>LustreError: 3259:0:(ldlm_lockd.c:1797:ldlm_handle_cp_callback()) change resource!
<0>LustreError: 3259:0:(ldlm_lock.c:1096:ldlm_granted_list_add_lock()) ASSERTION( list_empty(&lock->l_res_link) ) failed:
<0>LustreError: 3259:0:(ldlm_lock.c:1096:ldlm_granted_list_add_lock()) LBUG
<4>Pid: 3259, comm: ldlm_cb03_002
<4>
<4>Call Trace:
<4> [<ffffffffa04b9895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa04b9e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0898e87>] ldlm_grant_lock+0x9b7/0xa20 [ptlrpc]
<4> [<ffffffffa08b106f>] ldlm_handle_cp_callback+0x1cf/0x1150 [ptlrpc]
<4> [<ffffffffa08b3168>] ldlm_callback_handler+0x1178/0x1810 [ptlrpc]
<4> [<ffffffffa06db9af>] ? keys_fill+0x6f/0x190 [obdclass]
<4> [<ffffffffa08d177c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa08d7e2b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
<4> [<ffffffffa08e02c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4> [<ffffffffa04cb4ea>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
<4> [<ffffffffa08d7929>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa08e162d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
<4> [<ffffffffa08e0b40>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 3259, comm: ldlm_cb03_002 Not tainted 2.6.32-431.29.2.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152873c>] ? panic+0xa7/0x16f
<4> [<ffffffffa04b9eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0898e87>] ? ldlm_grant_lock+0x9b7/0xa20 [ptlrpc]
<4> [<ffffffffa08b106f>] ? ldlm_handle_cp_callback+0x1cf/0x1150 [ptlrpc]
<4> [<ffffffffa08b3168>] ? ldlm_callback_handler+0x1178/0x1810 [ptlrpc]
<4> [<ffffffffa06db9af>] ? keys_fill+0x6f/0x190 [obdclass]
<4> [<ffffffffa08d177c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
<4> [<ffffffffa08d7e2b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
<4> [<ffffffffa08e02c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4> [<ffffffffa04cb4ea>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
<4> [<ffffffffa08d7929>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4> [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4> [<ffffffffa08e162d>] ? ptlrpc_main+0xaed/0x1780 [ptlrpc]
<4> [<ffffffffa08e0b40>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

The ldlm lock was granted twice, by 20358 and 3259



 Comments   
Comment by Gerrit Updater [ 13/Jul/16 ]

Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/21275
Subject: LU-8391 ldlm: check double grant race after resource change
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df9f229615e9c8dbb7d004306a7310e62bd4a4f5

Comment by Jinshan Xiong (Inactive) [ 16/Jul/16 ]

can you please describe the problem a little bit in detail? I don't understand why the patch can fix the problem by moving the code in question above.

Comment by Li Dongyang (Inactive) [ 21/Jul/16 ]

In ldlm_handle_cp_callback(), thread 20358 takes the spin lock on the resource and lock, then checks if the lock has been granted or not:

	if (ldlm_is_destroyed(lock) ||
	    lock->l_granted_mode == lock->l_req_mode) {
		/* bug 11300: the lock has already been granted */
		unlock_res_and_lock(lock);
		LDLM_DEBUG(lock, "Double grant race happened");
		GOTO(out, rc = 0);
	}

and it figured out the resource has been changed, so it unlocks the spin lock on the resource and lock, goes ahead to allocate new resource.
then thread 3259 comes in, takes spin lock, does the same check above, the lock has not been granted yet, so it goes ahead to release the lock and change the resource again.

then 20358 finishes changing the resource, it takes spin lock again, calls ldlm_grant_lock() and then unlock_res_and_lock(). After this 3259 takes spin lock again, calls ldlm_grant_lock(),
but the lock has already been granted by 20358, thus triggering LASSERT.

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Comment by Gerrit Updater [ 29/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/21275/
Subject: LU-8391 ldlm: check double grant race after resource change
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fef1020406a0fb99ebb7871d7648bfad7a9d91a6

Comment by Peter Jones [ 29/Oct/18 ]

Landed for 2.12

Comment by Oleg Drokin [ 31/Oct/18 ]

Hm... I wonder if this was completely fixed?

I just got this in my testing of master-next:

[32847.180693] LustreError: 25301:0:(ldlm_lockd.c:1828:ldlm_handle_cp_callback()) change resource!
[32872.102225] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_cb00_002:25891]
[32872.102342] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod brd ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic pata_acpi drm_kms_helper ata_piix i2c_piix4 drm libata floppy pcspkr virtio_balloon serio_raw i2c_core virtio_console virtio_blk ip_tables [last unloaded: libcfs]
[32872.102342] CPU: 0 PID: 25891 Comm: ldlm_cb00_002 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.5-debug2 #1
[32872.102342] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32872.102342] task: ffff8802c4b64380 ti: ffff8803048dc000 task.ti: ffff8803048dc000
[32872.102342] RIP: 0010:[<ffffffff8177d736>]  [<ffffffff8177d736>] native_safe_halt+0x6/0x10
[32872.102342] RSP: 0018:ffff8803048dfbf0  EFLAGS: 00000202
[32872.102342] RAX: 0000000000000001 RBX: 000000000000008a RCX: 00000000b6dd2ad7
[32872.102342] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88033da19cc4
[32872.102342] RBP: ffff8803048dfbf0 R08: 00000000b6dd2ad7 R09: 00000000000000c0
[32872.102342] R10: 0000000000000000 R11: ffff8803048dfa9e R12: ffff8803048dfb80
[32872.102342] R13: ffffffff8177dc8e R14: ffff8803048dfb60 R15: ffff88022e3df800
[32872.102342] FS:  0000000000000000(0000) GS:ffff88033da00000(0000) knlGS:0000000000000000
[32872.102342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32872.160091] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [touch:5822]
[32872.160091] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod brd ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic pata_acpi drm_kms_helper ata_piix i2c_piix4 drm libata floppy pcspkr virtio_balloon serio_raw i2c_core virtio_console virtio_blk ip_tables [last unloaded: libcfs]
[32872.160091] CPU: 6 PID: 5822 Comm: touch Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.5-debug2 #1
[32872.160091] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32872.160091] task: ffff8801b897ac00 ti: ffff8802a41f0000 task.ti: ffff8802a41f0000
[32872.160091] RIP: 0010:[<ffffffff8177d736>]  [<ffffffff8177d736>] native_safe_halt+0x6/0x10
[32872.160091] RSP: 0018:ffff8802a41f3600  EFLAGS: 00000202
[32872.160091] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[32872.160091] RDX: 0000000000000005 RSI: 0000000000000005 RDI: ffff88028ae74da0
[32872.160091] RBP: ffff8802a41f3600 R08: 0000000000000001 R09: 0000000000000098
[32872.160091] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
[32872.160091] R13: 0000000000000001 R14: 0000000000000003 R15: 0000000000000000
[32872.160091] FS:  00007fa5a8256740(0000) GS:ffff88033db80000(0000) knlGS:0000000000000000
[32872.160091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32872.160091] CR2: 0000000000e46dc0 CR3: 000000027b3ec000 CR4: 00000000000006e0
[32872.160091] Call Trace:
[32872.160091]  [<ffffffff810638a5>] kvm_wait+0x65/0x70
[32872.160091]  [<ffffffff810fd067>] __pv_queued_spin_lock_slowpath+0x357/0x3d0
[32872.160091]  [<ffffffff813d06bd>] do_raw_spin_lock+0x6d/0xa0
[32872.160091]  [<ffffffff8177dc4e>] _raw_spin_lock+0x1e/0x20
[32872.160091]  [<ffffffffa07e1991>] ldlm_lock_change_resource+0x101/0x3a0 [ptlrpc]
[32872.160091]  [<ffffffffa08015e8>] ldlm_cli_enqueue_fini+0x3d8/0xdf0 [ptlrpc]
[32872.160091]  [<ffffffffa080442e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[32872.160091]  [<ffffffffa03e88fa>] mdc_enqueue_base+0x2fa/0x19e0 [mdc]
[32872.160091]  [<ffffffffa03ea72f>] mdc_intent_lock+0x12f/0x550 [mdc]
[32872.160091]  [<ffffffffa035b46c>] lmv_intent_lock+0x3fc/0xa10 [lmv]
[32872.160091]  [<ffffffffa145b920>] ll_lookup_it+0x280/0xbe0 [lustre]
[32872.160091]  [<ffffffffa145d242>] ll_lookup_nd+0x82/0x190 [lustre]
[32872.160091]  [<ffffffff8121bebd>] lookup_real+0x1d/0x50
[32872.160091]  [<ffffffff8121c992>] __lookup_hash+0x42/0x60
[32872.160091]  [<ffffffff817722e6>] lookup_slow+0x42/0xa7
[32872.160091]  [<ffffffff8121f53f>] link_path_walk+0x81f/0x8c0
[32872.160091]  [<ffffffff8122237e>] path_openat+0xae/0x6a0
[32872.160091]  [<ffffffff81223d9d>] do_filp_open+0x4d/0xb0
[32872.160091]  [<ffffffff8120fcf7>] do_sys_open+0x137/0x240
[32872.160091]  [<ffffffff8120fe1e>] SyS_open+0x1e/0x20
[32872.160091]  [<ffffffff81788855>] system_call_fastpath+0x1c/0x21
[32872.160091] Code: 65 48 8b 05 05 f9 88 7e 48 89 c2 b9 48 00 00 00 48 c1 ea 20 0f 30 e9 7a ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 
[32872.160091] Kernel panic - not syncing: softlockup: hung tasks
[32872.160091] CPU: 6 PID: 5822 Comm: touch Kdump: loaded Tainted: P        W  OEL ------------   3.10.0-7.5-debug2 #1
[32872.160091] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[32872.160091] Call Trace:
[32872.160091]  <IRQ>  [<ffffffff817744da>] dump_stack+0x19/0x1b
[32872.160091]  [<ffffffff8176eab1>] panic+0xe8/0x20d
[32872.160091]  [<ffffffff81134ad1>] watchdog_timer_fn+0x221/0x230
[32872.160091]  [<ffffffff810b4917>] __hrtimer_run_queues+0x117/0x2e0
[32872.160091]  [<ffffffff810b4faf>] hrtimer_interrupt+0xaf/0x1e0
[32872.160091]  [<ffffffff810542a5>] local_apic_timer_interrupt+0x35/0x60
[32872.160091]  [<ffffffff8178cf9d>] smp_apic_timer_interrupt+0x3d/0x50
[32872.160091]  [<ffffffff81789872>] apic_timer_interrupt+0x162/0x170
[32872.160091]  <EOI>  [<ffffffff8177d736>] ? native_safe_halt+0x6/0x10
[32872.160091]  [<ffffffff810638a5>] kvm_wait+0x65/0x70
[32872.160091]  [<ffffffff810fd067>] __pv_queued_spin_lock_slowpath+0x357/0x3d0
[32872.160091]  [<ffffffff813d06bd>] do_raw_spin_lock+0x6d/0xa0
[32872.160091]  [<ffffffff8177dc4e>] _raw_spin_lock+0x1e/0x20
[32872.160091]  [<ffffffffa07e1991>] ldlm_lock_change_resource+0x101/0x3a0 [ptlrpc]
[32872.160091]  [<ffffffffa08015e8>] ldlm_cli_enqueue_fini+0x3d8/0xdf0 [ptlrpc]
[32872.160091]  [<ffffffffa080442e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
[32872.160091]  [<ffffffffa03e88fa>] mdc_enqueue_base+0x2fa/0x19e0 [mdc]
[32872.160091]  [<ffffffffa03ea72f>] mdc_intent_lock+0x12f/0x550 [mdc]
[32872.160091]  [<ffffffffa035b46c>] lmv_intent_lock+0x3fc/0xa10 [lmv]
[32872.160091]  [<ffffffffa145b920>] ll_lookup_it+0x280/0xbe0 [lustre]
[32872.160091]  [<ffffffffa145d242>] ll_lookup_nd+0x82/0x190 [lustre]
[32872.160091]  [<ffffffff8121bebd>] lookup_real+0x1d/0x50
[32872.160091]  [<ffffffff8121c992>] __lookup_hash+0x42/0x60
[32872.160091]  [<ffffffff817722e6>] lookup_slow+0x42/0xa7
[32872.160091]  [<ffffffff8121f53f>] link_path_walk+0x81f/0x8c0
[32872.160091]  [<ffffffff8122237e>] path_openat+0xae/0x6a0
[32872.160091]  [<ffffffff81223d9d>] do_filp_open+0x4d/0xb0
[32872.160091]  [<ffffffff8120fcf7>] do_sys_open+0x137/0x240
[32872.160091]  [<ffffffff8120fe1e>] SyS_open+0x1e/0x20
[32872.160091]  [<ffffffff81788855>] system_call_fastpath+0x1c/0x21

just some other interaction?

Generated at Sat Feb 10 02:17:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.