[LU-12008] Accessing freed semaphore in llog_cat_add_rec after error Created: 25/Feb/19  Updated: 25/Feb/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have this crash sporadically happening since Nov 2018 in racer:

It's pretty clear that by the time we return from that llog_write_rec our log handle is no longer valid for some reason even though it was supposedly valid just as we got it.

I cannot see any obvious reasons as of why.

[83062.246113] Lustre: server umount lustre-MDT0000 complete
[83063.937570] LustreError: 14362:0:(llog_cat.c:597:llog_cat_add_rec()) llog_write_rec -5: lh=ffff88009fc5ce00
[83063.956038] BUG: unable to handle kernel paging request at ffff88009fc5ce30
[83063.956917] IP: [<ffffffff810ba28b>] up_write+0xb/0x30
[83063.956917] PGD 241b067 PUD 33fa01067 PMD 33f902067 PTE 800000009fc5c060
[83063.956917] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[83063.956917] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_mod loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon virtio_console pcspkr i2c_piix4 ip_tables rpcsec_gss_krb5 ata_generic drm_kms_helper crct10dif_pclmul ttm pata_acpi crct10dif_common drm ata_piix crc32c_intel drm_panel_orientation_quirks virtio_blk i2c_core serio_raw libata floppy [last unloaded: dm_flakey]
[83063.956917] CPU: 4 PID: 14362 Comm: mdt02_004 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
[83063.956917] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[83063.956917] task: ffff880083b8ea40 ti: ffff8802d1618000 task.ti: ffff8802d1618000
[83063.956917] RIP: 0010:[<ffffffff810ba28b>]  [<ffffffff810ba28b>] up_write+0xb/0x30
[83063.956917] RSP: 0018:ffff8802d161b8e0  EFLAGS: 00010246
[83063.956917] RAX: 0000000000000000 RBX: ffff88009fc5ce00 RCX: 0000000000000000
[83063.956917] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff88009fc5ce00
[83063.956917] RBP: ffff8802d161b930 R08: 0000000000000000 R09: 0000000000001cf1
[83063.956917] R10: 0000000000000198 R11: 0000000000000198 R12: ffff88008d8ade00
[83063.956917] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff8802be9f4e40
[83063.956917] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
[83063.956917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[83063.956917] CR2: ffff88009fc5ce30 CR3: 00000000b5452000 CR4: 00000000001607e0
[83063.956917] Call Trace:
[83063.956917]  [<ffffffffa04087aa>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass]
[83063.956917]  [<ffffffffa04004aa>] llog_add+0x7a/0x1a0 [obdclass]
[83063.956917]  [<ffffffffa042ed09>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[83063.956917]  [<ffffffffa073cae3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc]
[83063.956917]  [<ffffffffa07207ef>] top_trans_stop+0x65f/0xa10 [ptlrpc]
[83063.956917]  [<ffffffffa0df0b3c>] lod_trans_stop+0x25c/0x340 [lod]
[83063.956917]  [<ffffffffa0e1d7bc>] ? lod_xattr_set+0x11c/0xc60 [lod]
[83063.956917]  [<ffffffffa0ce6108>] mdd_trans_stop+0x28/0x16e [mdd]
[83063.956917]  [<ffffffffa0cde701>] mdd_xattr_set+0x371/0x1790 [mdd]
[83063.956917]  [<ffffffffa0cde390>] ? mdd_xattr_split+0xbb0/0xbb0 [mdd]
[83063.956917]  [<ffffffffa0d4f365>] mdt_reint_setxattr+0x445/0xfc0 [mdt]
[83063.956917]  [<ffffffffa0d4cc70>] mdt_reint_rec+0x80/0x210 [mdt]
[83063.956917]  [<ffffffffa0d29860>] mdt_reint_internal+0x770/0xb40 [mdt]
[83063.956917]  [<ffffffffa0d319e7>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt]
[83063.956917]  [<ffffffffa0d349b7>] mdt_reint+0x67/0x140 [mdt]
[83063.956917]  [<ffffffffa070bac5>] tgt_request_handle+0x915/0x15c0 [ptlrpc]
[83063.956917]  [<ffffffffa02befa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[83063.956917]  [<ffffffffa06b2249>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[83063.956917]  [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90
[83063.956917]  [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90
[83063.956917]  [<ffffffffa06b623c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc]
[83063.956917]  [<ffffffff810c33ce>] ? finish_task_switch+0x13e/0x1b0
[83063.956917]  [<ffffffffa06b56e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc]
[83063.956917]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[83063.956917]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[83063.956917]  [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21
[83063.956917]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140

Important note, in all occurrences we can see server just finished unmounting.

In fact in one case we can see some other unhappiness:

[51785.157945] Lustre: server umount lustre-MDT0001 complete
[51786.124008] LustreError: 24709:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -5: lh=ffff8801bd26ae00
[51786.134062] BUG: sleeping function called from invalid context at kernel/rwsem.c:51
[51786.134064] in_atomic(): 1, irqs_disabled(): 0, pid: 23874, name: umount
[51786.134088] CPU: 3 PID: 23874 Comm: umount Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
[51786.134090] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[51786.134092] Call Trace:
[51786.134183]  [<ffffffff817afbf2>] dump_stack+0x19/0x1b
[51786.134221]  [<ffffffff810c3bc9>] __might_sleep+0xd9/0x100
[51786.134226]  [<ffffffff817b6460>] down_write+0x20/0x50
[51786.134363]  [<ffffffffa11b1037>] osp_invalidate+0x177/0x210 [osp]
[51786.134391]  [<ffffffffa11c68a3>] osp_trans_stop_cb+0x133/0x180 [osp]
[51786.134399]  [<ffffffffa11c9647>] osp_trans_callback+0xa7/0xc0 [osp]
[51786.134403]  [<ffffffffa11a72c8>] osp_update_fini+0xc8/0x280 [osp]
[51786.134414]  [<ffffffff810b6050>] ? wake_up_atomic_t+0x30/0x30
[51786.134422]  [<ffffffffa11a76e2>] osp_process_config+0x262/0x560 [osp]
[51786.134559]  [<ffffffffa1124248>] lod_sub_process_config+0xe8/0x1e0 [lod]
[51786.134591]  [<ffffffffa112b650>] lod_process_config+0x4c0/0x1420 [lod]
[51786.134624]  [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0
[51786.134626]  [<ffffffff812127f4>] ? __kmalloc+0x634/0x660
[51786.134732]  [<ffffffffa0ff2bb7>] ? mdd_iocontrol+0x327/0x1430 [mdd]
[51786.134739]  [<ffffffffa0fedf68>] mdd_process_config+0x88/0x5d0 [mdd]
[51786.134937]  [<ffffffffa105831f>] mdt_device_fini+0x2df/0xfc0 [mdt]
[51786.135010]  [<ffffffffa036393c>] class_cleanup+0x55c/0xbb0 [obdclass]
[51786.135807]  [<ffffffffa0364c0c>] class_process_config+0x65c/0x2800 [obdclass]
[51786.136295]  [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0
[51786.136315]  [<ffffffff812127f4>] ? __kmalloc+0x634/0x660
[51786.136434]  [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[51786.136491]  [<ffffffffa0366f76>] class_manual_cleanup+0x1c6/0x6d0 [obdclass]
[51786.136660]  [<ffffffffa03966ee>] server_put_super+0x8ae/0xca0 [obdclass]
[51786.136673]  [<ffffffff81238f5a>] generic_shutdown_super+0x6a/0xf0
[51786.136676]  [<ffffffff81239342>] kill_anon_super+0x12/0x20
[51786.136695]  [<ffffffffa03699b2>] lustre_kill_super+0x32/0x50 [obdclass]
[51786.136697]  [<ffffffff81239739>] deactivate_locked_super+0x49/0x60
[51786.136699]  [<ffffffff81239e86>] deactivate_super+0x46/0x60
[51786.136726]  [<ffffffff81258f6f>] cleanup_mnt+0x3f/0x80
[51786.136729]  [<ffffffff81259002>] __cleanup_mnt+0x12/0x20
[51786.137097]  [<ffffffff810b1885>] task_work_run+0xb5/0xf0
[51786.137168]  [<ffffffff8102bc22>] do_notify_resume+0x92/0xb0
[51786.137185]  [<ffffffff817c5158>] int_signal+0x12/0x17
[51786.202373] BUG: unable to handle kernel paging request at ffff8801bd26ae30
[51786.203262] IP: [<ffffffff810ba28b>] up_write+0xb/0x30
[51786.203262] PGD 241b067 PUD 33f3fe067 PMD 33f214067 PTE 80000001bd26a060
[51786.203262] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[51786.203262] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common pcspkr virtio_balloon i2c_piix4 virtio_console ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm drm_panel_orientation_quirks ata_piix i2c_core virtio_blk serio_raw libata floppy [last unloaded: libcfs]
[51786.203262] CPU: 9 PID: 24709 Comm: mdt04_010 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
[51786.203262] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[51786.203262] task: ffff8800b49cc3c0 ti: ffff88024e038000 task.ti: ffff88024e038000
[51786.203262] RIP: 0010:[<ffffffff810ba28b>]  [<ffffffff810ba28b>] up_write+0xb/0x30
[51786.203262] RSP: 0018:ffff88024e03b8e0  EFLAGS: 00010246
[51786.203262] RAX: 0000000000000000 RBX: ffff8801bd26ae00 RCX: 0000000000000000
[51786.203262] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff8801bd26ae00
[51786.203262] RBP: ffff88024e03b930 R08: 0000000000000000 R09: 0000000000001933
[51786.203262] R10: 0000000000000198 R11: 0000000000000198 R12: ffff8800ac585e00
[51786.203262] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff880237459540
[51786.203262] FS:  0000000000000000(0000) GS:ffff88033dc40000(0000) knlGS:0000000000000000
[51786.203262] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[51786.203262] CR2: ffff8801bd26ae30 CR3: 0000000001c10000 CR4: 00000000000006e0
[51786.203262] Call Trace:
[51786.203262]  [<ffffffffa0330fba>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass]
[51786.203262]  [<ffffffffa03294aa>] llog_add+0x7a/0x1a0 [obdclass]
[51786.203262]  [<ffffffffa0357119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[51786.203262]  [<ffffffffa06407c3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc]
[51786.203262]  [<ffffffffa06244cf>] top_trans_stop+0x65f/0xa10 [ptlrpc]
[51786.203262]  [<ffffffffa1126b3c>] lod_trans_stop+0x25c/0x340 [lod]
[51786.203262]  [<ffffffffa101d108>] mdd_trans_stop+0x28/0x16e [mdd]
[51786.203262]  [<ffffffffa1015701>] mdd_xattr_set+0x371/0x1790 [mdd]
[51786.203262]  [<ffffffffa1076bb7>] ? mdt_version_save+0x67/0x120 [mdt]
[51786.203262]  [<ffffffffa1086215>] mdt_reint_setxattr+0x445/0xf90 [mdt]
[51786.203262]  [<ffffffffa1083ba0>] mdt_reint_rec+0x80/0x210 [mdt]
[51786.203262]  [<ffffffffa10607e3>] mdt_reint_internal+0x6e3/0xab0 [mdt]
[51786.203262]  [<ffffffffa1068967>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt]
[51786.203262]  [<ffffffffa106b927>] mdt_reint+0x67/0x140 [mdt]
[51786.203262]  [<ffffffffa0613525>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
[51786.203262]  [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[51786.203262]  [<ffffffffa05b6369>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[51786.311938]  [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90
[51786.311938]  [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90
[51786.311938]  [<ffffffffa05ba32c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc]
[51786.311938]  [<ffffffffa05b97d0>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
[51786.311938]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[51786.311938]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[51786.311938]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[51786.311938]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140

so it seems like the freeing is happening from another thread.


Generated at Sat Feb 10 02:48:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.