Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
I have this crash sporadically happening since Nov 2018 in racer:
It's pretty clear that by the time we return from that llog_write_rec our log handle is no longer valid for some reason even though it was supposedly valid just as we got it.
I cannot see any obvious reasons as of why.
[83062.246113] Lustre: server umount lustre-MDT0000 complete [83063.937570] LustreError: 14362:0:(llog_cat.c:597:llog_cat_add_rec()) llog_write_rec -5: lh=ffff88009fc5ce00 [83063.956038] BUG: unable to handle kernel paging request at ffff88009fc5ce30 [83063.956917] IP: [<ffffffff810ba28b>] up_write+0xb/0x30 [83063.956917] PGD 241b067 PUD 33fa01067 PMD 33f902067 PTE 800000009fc5c060 [83063.956917] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [83063.956917] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_mod loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon virtio_console pcspkr i2c_piix4 ip_tables rpcsec_gss_krb5 ata_generic drm_kms_helper crct10dif_pclmul ttm pata_acpi crct10dif_common drm ata_piix crc32c_intel drm_panel_orientation_quirks virtio_blk i2c_core serio_raw libata floppy [last unloaded: dm_flakey] [83063.956917] CPU: 4 PID: 14362 Comm: mdt02_004 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1 [83063.956917] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [83063.956917] task: ffff880083b8ea40 ti: ffff8802d1618000 task.ti: ffff8802d1618000 [83063.956917] RIP: 0010:[<ffffffff810ba28b>] [<ffffffff810ba28b>] up_write+0xb/0x30 [83063.956917] RSP: 0018:ffff8802d161b8e0 EFLAGS: 00010246 [83063.956917] RAX: 0000000000000000 RBX: ffff88009fc5ce00 RCX: 0000000000000000 [83063.956917] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff88009fc5ce00 [83063.956917] RBP: ffff8802d161b930 R08: 0000000000000000 R09: 0000000000001cf1 [83063.956917] R10: 0000000000000198 R11: 0000000000000198 R12: ffff88008d8ade00 [83063.956917] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff8802be9f4e40 [83063.956917] FS: 0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000 [83063.956917] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83063.956917] CR2: ffff88009fc5ce30 CR3: 00000000b5452000 CR4: 00000000001607e0 [83063.956917] Call Trace: [83063.956917] [<ffffffffa04087aa>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass] [83063.956917] [<ffffffffa04004aa>] llog_add+0x7a/0x1a0 [obdclass] [83063.956917] [<ffffffffa042ed09>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [83063.956917] [<ffffffffa073cae3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc] [83063.956917] [<ffffffffa07207ef>] top_trans_stop+0x65f/0xa10 [ptlrpc] [83063.956917] [<ffffffffa0df0b3c>] lod_trans_stop+0x25c/0x340 [lod] [83063.956917] [<ffffffffa0e1d7bc>] ? lod_xattr_set+0x11c/0xc60 [lod] [83063.956917] [<ffffffffa0ce6108>] mdd_trans_stop+0x28/0x16e [mdd] [83063.956917] [<ffffffffa0cde701>] mdd_xattr_set+0x371/0x1790 [mdd] [83063.956917] [<ffffffffa0cde390>] ? mdd_xattr_split+0xbb0/0xbb0 [mdd] [83063.956917] [<ffffffffa0d4f365>] mdt_reint_setxattr+0x445/0xfc0 [mdt] [83063.956917] [<ffffffffa0d4cc70>] mdt_reint_rec+0x80/0x210 [mdt] [83063.956917] [<ffffffffa0d29860>] mdt_reint_internal+0x770/0xb40 [mdt] [83063.956917] [<ffffffffa0d319e7>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt] [83063.956917] [<ffffffffa0d349b7>] mdt_reint+0x67/0x140 [mdt] [83063.956917] [<ffffffffa070bac5>] tgt_request_handle+0x915/0x15c0 [ptlrpc] [83063.956917] [<ffffffffa02befa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [83063.956917] [<ffffffffa06b2249>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [83063.956917] [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90 [83063.956917] [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90 [83063.956917] [<ffffffffa06b623c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc] [83063.956917] [<ffffffff810c33ce>] ? finish_task_switch+0x13e/0x1b0 [83063.956917] [<ffffffffa06b56e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc] [83063.956917] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [83063.956917] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [83063.956917] [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21 [83063.956917] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
Important note, in all occurrences we can see server just finished unmounting.
In fact in one case we can see some other unhappiness:
[51785.157945] Lustre: server umount lustre-MDT0001 complete [51786.124008] LustreError: 24709:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -5: lh=ffff8801bd26ae00 [51786.134062] BUG: sleeping function called from invalid context at kernel/rwsem.c:51 [51786.134064] in_atomic(): 1, irqs_disabled(): 0, pid: 23874, name: umount [51786.134088] CPU: 3 PID: 23874 Comm: umount Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1 [51786.134090] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [51786.134092] Call Trace: [51786.134183] [<ffffffff817afbf2>] dump_stack+0x19/0x1b [51786.134221] [<ffffffff810c3bc9>] __might_sleep+0xd9/0x100 [51786.134226] [<ffffffff817b6460>] down_write+0x20/0x50 [51786.134363] [<ffffffffa11b1037>] osp_invalidate+0x177/0x210 [osp] [51786.134391] [<ffffffffa11c68a3>] osp_trans_stop_cb+0x133/0x180 [osp] [51786.134399] [<ffffffffa11c9647>] osp_trans_callback+0xa7/0xc0 [osp] [51786.134403] [<ffffffffa11a72c8>] osp_update_fini+0xc8/0x280 [osp] [51786.134414] [<ffffffff810b6050>] ? wake_up_atomic_t+0x30/0x30 [51786.134422] [<ffffffffa11a76e2>] osp_process_config+0x262/0x560 [osp] [51786.134559] [<ffffffffa1124248>] lod_sub_process_config+0xe8/0x1e0 [lod] [51786.134591] [<ffffffffa112b650>] lod_process_config+0x4c0/0x1420 [lod] [51786.134624] [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0 [51786.134626] [<ffffffff812127f4>] ? __kmalloc+0x634/0x660 [51786.134732] [<ffffffffa0ff2bb7>] ? mdd_iocontrol+0x327/0x1430 [mdd] [51786.134739] [<ffffffffa0fedf68>] mdd_process_config+0x88/0x5d0 [mdd] [51786.134937] [<ffffffffa105831f>] mdt_device_fini+0x2df/0xfc0 [mdt] [51786.135010] [<ffffffffa036393c>] class_cleanup+0x55c/0xbb0 [obdclass] [51786.135807] [<ffffffffa0364c0c>] class_process_config+0x65c/0x2800 [obdclass] [51786.136295] [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0 [51786.136315] [<ffffffff812127f4>] ? __kmalloc+0x634/0x660 [51786.136434] [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [51786.136491] [<ffffffffa0366f76>] class_manual_cleanup+0x1c6/0x6d0 [obdclass] [51786.136660] [<ffffffffa03966ee>] server_put_super+0x8ae/0xca0 [obdclass] [51786.136673] [<ffffffff81238f5a>] generic_shutdown_super+0x6a/0xf0 [51786.136676] [<ffffffff81239342>] kill_anon_super+0x12/0x20 [51786.136695] [<ffffffffa03699b2>] lustre_kill_super+0x32/0x50 [obdclass] [51786.136697] [<ffffffff81239739>] deactivate_locked_super+0x49/0x60 [51786.136699] [<ffffffff81239e86>] deactivate_super+0x46/0x60 [51786.136726] [<ffffffff81258f6f>] cleanup_mnt+0x3f/0x80 [51786.136729] [<ffffffff81259002>] __cleanup_mnt+0x12/0x20 [51786.137097] [<ffffffff810b1885>] task_work_run+0xb5/0xf0 [51786.137168] [<ffffffff8102bc22>] do_notify_resume+0x92/0xb0 [51786.137185] [<ffffffff817c5158>] int_signal+0x12/0x17 [51786.202373] BUG: unable to handle kernel paging request at ffff8801bd26ae30 [51786.203262] IP: [<ffffffff810ba28b>] up_write+0xb/0x30 [51786.203262] PGD 241b067 PUD 33f3fe067 PMD 33f214067 PTE 80000001bd26a060 [51786.203262] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [51786.203262] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common pcspkr virtio_balloon i2c_piix4 virtio_console ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm drm_panel_orientation_quirks ata_piix i2c_core virtio_blk serio_raw libata floppy [last unloaded: libcfs] [51786.203262] CPU: 9 PID: 24709 Comm: mdt04_010 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.6-debug #1 [51786.203262] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [51786.203262] task: ffff8800b49cc3c0 ti: ffff88024e038000 task.ti: ffff88024e038000 [51786.203262] RIP: 0010:[<ffffffff810ba28b>] [<ffffffff810ba28b>] up_write+0xb/0x30 [51786.203262] RSP: 0018:ffff88024e03b8e0 EFLAGS: 00010246 [51786.203262] RAX: 0000000000000000 RBX: ffff8801bd26ae00 RCX: 0000000000000000 [51786.203262] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff8801bd26ae00 [51786.203262] RBP: ffff88024e03b930 R08: 0000000000000000 R09: 0000000000001933 [51786.203262] R10: 0000000000000198 R11: 0000000000000198 R12: ffff8800ac585e00 [51786.203262] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff880237459540 [51786.203262] FS: 0000000000000000(0000) GS:ffff88033dc40000(0000) knlGS:0000000000000000 [51786.203262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [51786.203262] CR2: ffff8801bd26ae30 CR3: 0000000001c10000 CR4: 00000000000006e0 [51786.203262] Call Trace: [51786.203262] [<ffffffffa0330fba>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass] [51786.203262] [<ffffffffa03294aa>] llog_add+0x7a/0x1a0 [obdclass] [51786.203262] [<ffffffffa0357119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [51786.203262] [<ffffffffa06407c3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc] [51786.203262] [<ffffffffa06244cf>] top_trans_stop+0x65f/0xa10 [ptlrpc] [51786.203262] [<ffffffffa1126b3c>] lod_trans_stop+0x25c/0x340 [lod] [51786.203262] [<ffffffffa101d108>] mdd_trans_stop+0x28/0x16e [mdd] [51786.203262] [<ffffffffa1015701>] mdd_xattr_set+0x371/0x1790 [mdd] [51786.203262] [<ffffffffa1076bb7>] ? mdt_version_save+0x67/0x120 [mdt] [51786.203262] [<ffffffffa1086215>] mdt_reint_setxattr+0x445/0xf90 [mdt] [51786.203262] [<ffffffffa1083ba0>] mdt_reint_rec+0x80/0x210 [mdt] [51786.203262] [<ffffffffa10607e3>] mdt_reint_internal+0x6e3/0xab0 [mdt] [51786.203262] [<ffffffffa1068967>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt] [51786.203262] [<ffffffffa106b927>] mdt_reint+0x67/0x140 [mdt] [51786.203262] [<ffffffffa0613525>] tgt_request_handle+0xaf5/0x1590 [ptlrpc] [51786.203262] [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [51786.203262] [<ffffffffa05b6369>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [51786.311938] [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90 [51786.311938] [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90 [51786.311938] [<ffffffffa05ba32c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc] [51786.311938] [<ffffffffa05b97d0>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc] [51786.311938] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [51786.311938] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140 [51786.311938] [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21 [51786.311938] [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
so it seems like the freeing is happening from another thread.