[LU-9839] lov_object.c:879:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed Created: 06/Aug/17 Updated: 14/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ornl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Just hit this assertion on master-next that was immediately followed by a NULL pointer deref [60342.200382] Lustre: DEBUG MARKER: == sanity test 405: Various layout swap lock tests =================================================== 16:21:48 (1501964508) [60385.832096] LustreError: 10934:0:(lov_object.c:879:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed: [60385.832103] BUG: unable to handle kernel NULL pointer dereference at (null) [60385.832111] IP: [<ffffffffa078decc>] lov_sub_get+0x1ec/0x760 [lov] [60385.832113] PGD 0 [60385.832114] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [60385.832130] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) brd ext4 loop mbcache jbd2 rpcsec_gss_krb5 syscopyarea sysfillrect ata_generic sysimgblt pata_acpi ttm virtio_balloon drm_kms_helper pcspkr ata_piix serio_raw virtio_console virtio_blk floppy drm i2c_piix4 libata i2c_core nfsd ip_tables [last unloaded: libcfs] [60385.832132] CPU: 2 PID: 5577 Comm: kworker/u16:0 Tainted: G W OE ------------ 3.10.0-debug #1 [60385.832133] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [60385.832138] Workqueue: writeback bdi_writeback_workfn (flush-lustre-3) [60385.832138] task: ffff88009d1c88c0 ti: ffff880058194000 task.ti: ffff880058194000 [60385.832143] RIP: 0010:[<ffffffffa078decc>] [<ffffffffa078decc>] lov_sub_get+0x1ec/0x760 [lov] [60385.832145] RSP: 0018:ffff8800581978c0 EFLAGS: 00010297 [60385.832146] RAX: ffff880014330f68 RBX: ffff880078ddedf0 RCX: 0000000000000000 [60385.832146] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880014330f68 [60385.832147] RBP: ffff880058197900 R08: 0000000000000001 R09: 0000000000000000 [60385.832147] R10: 0000000000000000 R11: ffffffffffffffff R12: 0000000000000000 [60385.832148] R13: ffff880078ddefd0 R14: ffff880078ddeda0 R15: ffff88000c9b7e10 [60385.832149] FS: 0000000000000000(0000) GS:ffff8800bc680000(0000) knlGS:0000000000000000 [60385.832149] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [60385.832150] CR2: 0000000000000000 CR3: 00000000ae6b8000 CR4: 00000000000006e0 [60385.832152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [60385.832152] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [60385.832153] Stack: [60385.832155] 0000000000000000 0000000000000000 ffff880021d0fed0 0000000000000000 [60385.832156] 0000000000000000 ffff88002e8451d0 0000000000000000 ffff880078ddeda0 [60385.832157] ffff8800581979b0 ffffffffa078ecda ffff88000c9b7e10 ffff880021dabfa0 [60385.832157] Call Trace: [60385.832162] [<ffffffffa078ecda>] lov_io_iter_init+0x1ca/0x8b0 [lov] [60385.832185] [<ffffffffa037794c>] cl_io_iter_init+0x5c/0x120 [obdclass] [60385.832201] [<ffffffffa0379c5c>] cl_io_loop+0x19c/0xb30 [obdclass] [60385.832213] [<ffffffffa0dcae9b>] cl_sync_file_range+0x2db/0x380 [lustre] [60385.832222] [<ffffffffa0dec5ba>] ll_writepages+0x7a/0x200 [lustre] [60385.832225] [<ffffffff8117e8b1>] do_writepages+0x21/0x50 [60385.832227] [<ffffffff81218f50>] __writeback_single_inode+0x40/0x2b0 [60385.832228] [<ffffffff81219c11>] writeback_sb_inodes+0x2b1/0x4d0 [60385.832230] [<ffffffff81219ecf>] __writeback_inodes_wb+0x9f/0xd0 [60385.832232] [<ffffffff8121a76b>] wb_writeback+0x28b/0x340 [60385.832233] [<ffffffff8121c9cc>] bdi_writeback_workfn+0x20c/0x4e0 [60385.832235] [<ffffffff8109add6>] process_one_work+0x206/0x5b0 [60385.832236] [<ffffffff8109ad6b>] ? process_one_work+0x19b/0x5b0 [60385.832238] [<ffffffff8109b29b>] worker_thread+0x11b/0x3a0 [60385.832239] [<ffffffff8109b180>] ? process_one_work+0x5b0/0x5b0 [60385.832240] [<ffffffff810a2eda>] kthread+0xea/0xf0 [60385.832242] [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140 [60385.832245] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [60385.832246] [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140 [60385.832257] Code: 85 bc 04 00 00 44 8b 82 08 01 00 00 8b 4d cc 48 8b 75 c0 44 39 c1 0f 83 aa 04 00 00 48 8b 92 10 01 00 00 48 89 c7 4a 8b 54 22 18 <48> 8b 0c f2 c7 83 b0 01 00 00 00 00 00 00 4c 89 7b 38 48 81 c1 [60385.832261] RIP [<ffffffffa078decc>] lov_sub_get+0x1ec/0x760 [lov] [60385.832262] RSP <ffff8800581978c0> modules and crashdump are in /exports/crashdumps/192.168.10.224-2017-08-05-16\:22\:38 on my node |
| Comments |
| Comment by Oleg Drokin [ 06/Aug/17 ] |
|
Also the backtrace on the asserting thread is: crash> bt 10934
PID: 10934 TASK: ffff88003811e540 CPU: 3 COMMAND: "swap_lock_test"
#0 [ffff8800bc6c5e68] crash_nmi_callback at ffffffff81047152
#1 [ffff8800bc6c5e78] nmi_handle at ffffffff81708369
#2 [ffff8800bc6c5ec8] do_nmi at ffffffff817084a8
#3 [ffff8800bc6c5ef0] end_repeat_nmi at ffffffff817077e3
[exception RIP: vgacon_scroll+877]
RIP: ffffffff813d405d RSP: ffff880014dc3680 RFLAGS: 00000083
RAX: 0000000000000004 RBX: ffff8800b8de4800 RCX: 00000000000000a0
RDX: ffff8800000bd6e0 RSI: ffff8800b8df8520 RDI: 00000000000000a0
RBP: ffff880014dc36b8 R8: 0000000000008520 R9: 0000000000000730
R10: 0000000000000000 R11: 0000000000000050 R12: ffff8800000bd6e0
R13: 0000000000000198 R14: 0000000000000198 R15: 000000000000ffa0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff880014dc3680] vgacon_scroll at ffffffff813d405d
#5 [ffff880014dc36c0] scrup at ffffffff8144eacc
#6 [ffff880014dc36f0] lf at ffffffff8144eb80
#7 [ffff880014dc3720] vt_console_print at ffffffff8144ee62
#8 [ffff880014dc3788] call_console_drivers.constprop.17 at ffffffff81077e41
#9 [ffff880014dc37b0] console_unlock at ffffffff8107862c
#10 [ffff880014dc37e8] vprintk_emit at ffffffff81078926
#11 [ffff880014dc3858] printk at ffffffff816f8e55
#12 [ffff880014dc38b8] cfs_print_to_console at ffffffffa01bb59a [libcfs]
#13 [ffff880014dc38e8] libcfs_debug_vmsg2 at ffffffffa01c65ee [libcfs]
#14 [ffff880014dc3a30] libcfs_debug_msg at ffffffffa01c6cb7 [libcfs]
#15 [ffff880014dc3a90] lov_conf_set at ffffffffa07a19e8 [lov]
#16 [ffff880014dc3b00] cl_conf_set at ffffffffa0370240 [obdclass]
#17 [ffff880014dc3b30] ll_layout_conf at ffffffffa0dceea1 [lustre]
#18 [ffff880014dc3bb8] ll_layout_refresh at ffffffffa0dcf7dd [lustre]
#19 [ffff880014dc3c68] vvp_io_init at ffffffffa0e14c67 [lustre]
#20 [ffff880014dc3cb8] cl_io_init0 at ffffffffa0377d48 [obdclass]
#21 [ffff880014dc3cf0] cl_io_init at ffffffffa0377eda [obdclass]
#22 [ffff880014dc3d20] cl_get_grouplock at ffffffffa0e0d02a [lustre]
#23 [ffff880014dc3d70] ll_get_grouplock at ffffffffa0dd03ee [lustre]
#24 [ffff880014dc3e08] ll_file_ioctl at ffffffffa0dd47cf [lustre]
#25 [ffff880014dc3eb8] do_vfs_ioctl at ffffffff81201985
#26 [ffff880014dc3f30] sys_ioctl at ffffffff81201c41
#27 [ffff880014dc3f80] system_call_fastpath at ffffffff8170fc89
|
| Comment by Oleg Drokin [ 15/Oct/17 ] |
|
Just hit this assertion again on master-next in sanity test 405: [ 6650.701477] Lustre: DEBUG MARKER: == sanity test 405: Various layout swap lock tests =================================================== 21:10:20 (1507857020) [ 6718.387820] LustreError: 7955:0:(lov_object.c:879:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed: [ 6718.387821] LustreError: 7955:0:(lov_object.c:879:lov_layout_change()) LBUG [ 6718.387822] Pid: 7955, comm: swap_lock_test [ 6718.387822] Call Trace: [ 6718.387836] [<ffffffffa01d57fe>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 6718.387843] [<ffffffffa01d588c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 6718.387853] [<ffffffffa08de974>] lov_conf_set+0xab4/0xac0 [lov] [ 6718.387869] [<ffffffffa0f70001>] ? ll_file_flock+0x841/0xd80 [lustre] [ 6718.387898] [<ffffffffa03931b0>] cl_conf_set+0x60/0x120 [obdclass] [ 6718.387908] [<ffffffffa0f7d301>] ll_layout_conf+0x81/0x400 [lustre] [ 6718.387918] [<ffffffffa0f7dc3d>] ll_layout_refresh+0x5bd/0xb10 [lustre] [ 6718.387933] [<ffffffffa0fc3797>] vvp_io_init+0x347/0x440 [lustre] [ 6718.387957] [<ffffffffa039acb8>] cl_io_init0.isra.17+0x88/0x160 [obdclass] [ 6718.387974] [<ffffffffa039ae4a>] cl_io_init+0x3a/0x80 [obdclass] [ 6718.387986] [<ffffffffa0fbbc2a>] cl_get_grouplock+0xca/0x2f0 [lustre] [ 6718.387995] [<ffffffffa0f7e84e>] ll_get_grouplock+0x22e/0x6d0 [lustre] [ 6718.388004] [<ffffffffa0f834b2>] ll_file_ioctl+0x47c2/0x48a0 [lustre] [ 6718.388008] [<ffffffff81201985>] do_vfs_ioctl+0x305/0x520 [ 6718.388011] [<ffffffff81706487>] ? _raw_spin_unlock_irq+0x27/0x50 [ 6718.388013] [<ffffffff81201c41>] SyS_ioctl+0xa1/0xc0 [ 6718.388016] [<ffffffff8170fc89>] system_call_fastpath+0x16/0x1b [ 6718.388017] [ 6718.388018] Kernel panic - not syncing: LBUG Crashdump is in 192.168.10.224-2017-10-12-21\:11\:34 on my system |
| Comment by Oleg Drokin [ 10/Nov/17 ] |
|
Just hit it once more on current master-next |
| Comment by James A Simmons [ 08/Apr/23 ] |
|
Just ran into this on 2.15 LTS production system. |
| Comment by James A Simmons [ 27/Apr/23 ] |
|
This bug shouldn't happen. In lov_layout_change() we call llo_delete() which in turn calls lov_layout_wait() that empties lo_waitq. The reference count should be zero. Then the LASSERT happens a few steps later. How could it be possible to add new IO in that time? |
| Comment by Gerrit Updater [ 28/Apr/23 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50800 |
| Comment by Gerrit Updater [ 10/Jun/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51269 |
| Comment by Alexander Zarochentsev [ 10/Jun/23 ] |
crashed and gave additional info about a parallel i/o: [2577729.757681] LustreError: 11039:0:(lov_object.c:1286:lov_layout_change()) ASSERTION( atomic_read(&lov->lo_active_ios) == 0 ) failed: ios: 1, last io: osc_lru_shrink [2577729.776963] LustreError: 11039:0:(lov_object.c:1286:lov_layout_change()) LBUG |
| Comment by Gerrit Updater [ 12/Jul/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51638 |