[LU-8391] ASSERTION( list_empty(&lock->l_res_link) ) Created: 13/Jul/16 Updated: 01/Aug/19 Resolved: 29/Oct/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Li Dongyang (Inactive) | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
lustre client crashed after eviction: <3>LustreError: 20358:0:(ldlm_lockd.c:1797:ldlm_handle_cp_callback()) change resource! The ldlm lock was granted twice, by 20358 and 3259 |
| Comments |
| Comment by Gerrit Updater [ 13/Jul/16 ] |
|
Li Dongyang (dongyang.li@anu.edu.au) uploaded a new patch: http://review.whamcloud.com/21275 |
| Comment by Jinshan Xiong (Inactive) [ 16/Jul/16 ] |
|
can you please describe the problem a little bit in detail? I don't understand why the patch can fix the problem by moving the code in question above. |
| Comment by Li Dongyang (Inactive) [ 21/Jul/16 ] |
|
In ldlm_handle_cp_callback(), thread 20358 takes the spin lock on the resource and lock, then checks if the lock has been granted or not: if (ldlm_is_destroyed(lock) || lock->l_granted_mode == lock->l_req_mode) { /* bug 11300: the lock has already been granted */ unlock_res_and_lock(lock); LDLM_DEBUG(lock, "Double grant race happened"); GOTO(out, rc = 0); } and it figured out the resource has been changed, so it unlocks the spin lock on the resource and lock, goes ahead to allocate new resource. then 20358 finishes changing the resource, it takes spin lock again, calls ldlm_grant_lock() and then unlock_res_and_lock(). After this 3259 takes spin lock again, calls ldlm_grant_lock(), |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
close old tickets |
| Comment by Gerrit Updater [ 29/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/21275/ |
| Comment by Peter Jones [ 29/Oct/18 ] |
|
Landed for 2.12 |
| Comment by Oleg Drokin [ 31/Oct/18 ] |
|
Hm... I wonder if this was completely fixed? I just got this in my testing of master-next: [32847.180693] LustreError: 25301:0:(ldlm_lockd.c:1828:ldlm_handle_cp_callback()) change resource! [32872.102225] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_cb00_002:25891] [32872.102342] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod brd ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic pata_acpi drm_kms_helper ata_piix i2c_piix4 drm libata floppy pcspkr virtio_balloon serio_raw i2c_core virtio_console virtio_blk ip_tables [last unloaded: libcfs] [32872.102342] CPU: 0 PID: 25891 Comm: ldlm_cb00_002 Kdump: loaded Tainted: P W OE ------------ 3.10.0-7.5-debug2 #1 [32872.102342] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [32872.102342] task: ffff8802c4b64380 ti: ffff8803048dc000 task.ti: ffff8803048dc000 [32872.102342] RIP: 0010:[<ffffffff8177d736>] [<ffffffff8177d736>] native_safe_halt+0x6/0x10 [32872.102342] RSP: 0018:ffff8803048dfbf0 EFLAGS: 00000202 [32872.102342] RAX: 0000000000000001 RBX: 000000000000008a RCX: 00000000b6dd2ad7 [32872.102342] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88033da19cc4 [32872.102342] RBP: ffff8803048dfbf0 R08: 00000000b6dd2ad7 R09: 00000000000000c0 [32872.102342] R10: 0000000000000000 R11: ffff8803048dfa9e R12: ffff8803048dfb80 [32872.102342] R13: ffffffff8177dc8e R14: ffff8803048dfb60 R15: ffff88022e3df800 [32872.102342] FS: 0000000000000000(0000) GS:ffff88033da00000(0000) knlGS:0000000000000000 [32872.102342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32872.160091] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [touch:5822] [32872.160091] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod brd ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic pata_acpi drm_kms_helper ata_piix i2c_piix4 drm libata floppy pcspkr virtio_balloon serio_raw i2c_core virtio_console virtio_blk ip_tables [last unloaded: libcfs] [32872.160091] CPU: 6 PID: 5822 Comm: touch Kdump: loaded Tainted: P W OE ------------ 3.10.0-7.5-debug2 #1 [32872.160091] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [32872.160091] task: ffff8801b897ac00 ti: ffff8802a41f0000 task.ti: ffff8802a41f0000 [32872.160091] RIP: 0010:[<ffffffff8177d736>] [<ffffffff8177d736>] native_safe_halt+0x6/0x10 [32872.160091] RSP: 0018:ffff8802a41f3600 EFLAGS: 00000202 [32872.160091] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000 [32872.160091] RDX: 0000000000000005 RSI: 0000000000000005 RDI: ffff88028ae74da0 [32872.160091] RBP: ffff8802a41f3600 R08: 0000000000000001 R09: 0000000000000098 [32872.160091] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 [32872.160091] R13: 0000000000000001 R14: 0000000000000003 R15: 0000000000000000 [32872.160091] FS: 00007fa5a8256740(0000) GS:ffff88033db80000(0000) knlGS:0000000000000000 [32872.160091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32872.160091] CR2: 0000000000e46dc0 CR3: 000000027b3ec000 CR4: 00000000000006e0 [32872.160091] Call Trace: [32872.160091] [<ffffffff810638a5>] kvm_wait+0x65/0x70 [32872.160091] [<ffffffff810fd067>] __pv_queued_spin_lock_slowpath+0x357/0x3d0 [32872.160091] [<ffffffff813d06bd>] do_raw_spin_lock+0x6d/0xa0 [32872.160091] [<ffffffff8177dc4e>] _raw_spin_lock+0x1e/0x20 [32872.160091] [<ffffffffa07e1991>] ldlm_lock_change_resource+0x101/0x3a0 [ptlrpc] [32872.160091] [<ffffffffa08015e8>] ldlm_cli_enqueue_fini+0x3d8/0xdf0 [ptlrpc] [32872.160091] [<ffffffffa080442e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc] [32872.160091] [<ffffffffa03e88fa>] mdc_enqueue_base+0x2fa/0x19e0 [mdc] [32872.160091] [<ffffffffa03ea72f>] mdc_intent_lock+0x12f/0x550 [mdc] [32872.160091] [<ffffffffa035b46c>] lmv_intent_lock+0x3fc/0xa10 [lmv] [32872.160091] [<ffffffffa145b920>] ll_lookup_it+0x280/0xbe0 [lustre] [32872.160091] [<ffffffffa145d242>] ll_lookup_nd+0x82/0x190 [lustre] [32872.160091] [<ffffffff8121bebd>] lookup_real+0x1d/0x50 [32872.160091] [<ffffffff8121c992>] __lookup_hash+0x42/0x60 [32872.160091] [<ffffffff817722e6>] lookup_slow+0x42/0xa7 [32872.160091] [<ffffffff8121f53f>] link_path_walk+0x81f/0x8c0 [32872.160091] [<ffffffff8122237e>] path_openat+0xae/0x6a0 [32872.160091] [<ffffffff81223d9d>] do_filp_open+0x4d/0xb0 [32872.160091] [<ffffffff8120fcf7>] do_sys_open+0x137/0x240 [32872.160091] [<ffffffff8120fe1e>] SyS_open+0x1e/0x20 [32872.160091] [<ffffffff81788855>] system_call_fastpath+0x1c/0x21 [32872.160091] Code: 65 48 8b 05 05 f9 88 7e 48 89 c2 b9 48 00 00 00 48 c1 ea 20 0f 30 e9 7a ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 [32872.160091] Kernel panic - not syncing: softlockup: hung tasks [32872.160091] CPU: 6 PID: 5822 Comm: touch Kdump: loaded Tainted: P W OEL ------------ 3.10.0-7.5-debug2 #1 [32872.160091] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [32872.160091] Call Trace: [32872.160091] <IRQ> [<ffffffff817744da>] dump_stack+0x19/0x1b [32872.160091] [<ffffffff8176eab1>] panic+0xe8/0x20d [32872.160091] [<ffffffff81134ad1>] watchdog_timer_fn+0x221/0x230 [32872.160091] [<ffffffff810b4917>] __hrtimer_run_queues+0x117/0x2e0 [32872.160091] [<ffffffff810b4faf>] hrtimer_interrupt+0xaf/0x1e0 [32872.160091] [<ffffffff810542a5>] local_apic_timer_interrupt+0x35/0x60 [32872.160091] [<ffffffff8178cf9d>] smp_apic_timer_interrupt+0x3d/0x50 [32872.160091] [<ffffffff81789872>] apic_timer_interrupt+0x162/0x170 [32872.160091] <EOI> [<ffffffff8177d736>] ? native_safe_halt+0x6/0x10 [32872.160091] [<ffffffff810638a5>] kvm_wait+0x65/0x70 [32872.160091] [<ffffffff810fd067>] __pv_queued_spin_lock_slowpath+0x357/0x3d0 [32872.160091] [<ffffffff813d06bd>] do_raw_spin_lock+0x6d/0xa0 [32872.160091] [<ffffffff8177dc4e>] _raw_spin_lock+0x1e/0x20 [32872.160091] [<ffffffffa07e1991>] ldlm_lock_change_resource+0x101/0x3a0 [ptlrpc] [32872.160091] [<ffffffffa08015e8>] ldlm_cli_enqueue_fini+0x3d8/0xdf0 [ptlrpc] [32872.160091] [<ffffffffa080442e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc] [32872.160091] [<ffffffffa03e88fa>] mdc_enqueue_base+0x2fa/0x19e0 [mdc] [32872.160091] [<ffffffffa03ea72f>] mdc_intent_lock+0x12f/0x550 [mdc] [32872.160091] [<ffffffffa035b46c>] lmv_intent_lock+0x3fc/0xa10 [lmv] [32872.160091] [<ffffffffa145b920>] ll_lookup_it+0x280/0xbe0 [lustre] [32872.160091] [<ffffffffa145d242>] ll_lookup_nd+0x82/0x190 [lustre] [32872.160091] [<ffffffff8121bebd>] lookup_real+0x1d/0x50 [32872.160091] [<ffffffff8121c992>] __lookup_hash+0x42/0x60 [32872.160091] [<ffffffff817722e6>] lookup_slow+0x42/0xa7 [32872.160091] [<ffffffff8121f53f>] link_path_walk+0x81f/0x8c0 [32872.160091] [<ffffffff8122237e>] path_openat+0xae/0x6a0 [32872.160091] [<ffffffff81223d9d>] do_filp_open+0x4d/0xb0 [32872.160091] [<ffffffff8120fcf7>] do_sys_open+0x137/0x240 [32872.160091] [<ffffffff8120fe1e>] SyS_open+0x1e/0x20 [32872.160091] [<ffffffff81788855>] system_call_fastpath+0x1c/0x21 just some other interaction? |