Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12008

Accessing freed semaphore in llog_cat_add_rec after error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      I have this crash sporadically happening since Nov 2018 in racer:

      It's pretty clear that by the time we return from that llog_write_rec our log handle is no longer valid for some reason even though it was supposedly valid just as we got it.

      I cannot see any obvious reasons as of why.

      [83062.246113] Lustre: server umount lustre-MDT0000 complete
      [83063.937570] LustreError: 14362:0:(llog_cat.c:597:llog_cat_add_rec()) llog_write_rec -5: lh=ffff88009fc5ce00
      [83063.956038] BUG: unable to handle kernel paging request at ffff88009fc5ce30
      [83063.956917] IP: [<ffffffff810ba28b>] up_write+0xb/0x30
      [83063.956917] PGD 241b067 PUD 33fa01067 PMD 33f902067 PTE 800000009fc5c060
      [83063.956917] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      [83063.956917] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_mod loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon virtio_console pcspkr i2c_piix4 ip_tables rpcsec_gss_krb5 ata_generic drm_kms_helper crct10dif_pclmul ttm pata_acpi crct10dif_common drm ata_piix crc32c_intel drm_panel_orientation_quirks virtio_blk i2c_core serio_raw libata floppy [last unloaded: dm_flakey]
      [83063.956917] CPU: 4 PID: 14362 Comm: mdt02_004 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
      [83063.956917] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [83063.956917] task: ffff880083b8ea40 ti: ffff8802d1618000 task.ti: ffff8802d1618000
      [83063.956917] RIP: 0010:[<ffffffff810ba28b>]  [<ffffffff810ba28b>] up_write+0xb/0x30
      [83063.956917] RSP: 0018:ffff8802d161b8e0  EFLAGS: 00010246
      [83063.956917] RAX: 0000000000000000 RBX: ffff88009fc5ce00 RCX: 0000000000000000
      [83063.956917] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff88009fc5ce00
      [83063.956917] RBP: ffff8802d161b930 R08: 0000000000000000 R09: 0000000000001cf1
      [83063.956917] R10: 0000000000000198 R11: 0000000000000198 R12: ffff88008d8ade00
      [83063.956917] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff8802be9f4e40
      [83063.956917] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
      [83063.956917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [83063.956917] CR2: ffff88009fc5ce30 CR3: 00000000b5452000 CR4: 00000000001607e0
      [83063.956917] Call Trace:
      [83063.956917]  [<ffffffffa04087aa>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass]
      [83063.956917]  [<ffffffffa04004aa>] llog_add+0x7a/0x1a0 [obdclass]
      [83063.956917]  [<ffffffffa042ed09>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [83063.956917]  [<ffffffffa073cae3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc]
      [83063.956917]  [<ffffffffa07207ef>] top_trans_stop+0x65f/0xa10 [ptlrpc]
      [83063.956917]  [<ffffffffa0df0b3c>] lod_trans_stop+0x25c/0x340 [lod]
      [83063.956917]  [<ffffffffa0e1d7bc>] ? lod_xattr_set+0x11c/0xc60 [lod]
      [83063.956917]  [<ffffffffa0ce6108>] mdd_trans_stop+0x28/0x16e [mdd]
      [83063.956917]  [<ffffffffa0cde701>] mdd_xattr_set+0x371/0x1790 [mdd]
      [83063.956917]  [<ffffffffa0cde390>] ? mdd_xattr_split+0xbb0/0xbb0 [mdd]
      [83063.956917]  [<ffffffffa0d4f365>] mdt_reint_setxattr+0x445/0xfc0 [mdt]
      [83063.956917]  [<ffffffffa0d4cc70>] mdt_reint_rec+0x80/0x210 [mdt]
      [83063.956917]  [<ffffffffa0d29860>] mdt_reint_internal+0x770/0xb40 [mdt]
      [83063.956917]  [<ffffffffa0d319e7>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt]
      [83063.956917]  [<ffffffffa0d349b7>] mdt_reint+0x67/0x140 [mdt]
      [83063.956917]  [<ffffffffa070bac5>] tgt_request_handle+0x915/0x15c0 [ptlrpc]
      [83063.956917]  [<ffffffffa02befa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [83063.956917]  [<ffffffffa06b2249>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
      [83063.956917]  [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90
      [83063.956917]  [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90
      [83063.956917]  [<ffffffffa06b623c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc]
      [83063.956917]  [<ffffffff810c33ce>] ? finish_task_switch+0x13e/0x1b0
      [83063.956917]  [<ffffffffa06b56e0>] ? ptlrpc_register_service+0xfe0/0xfe0 [ptlrpc]
      [83063.956917]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
      [83063.956917]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      [83063.956917]  [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21
      [83063.956917]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      

      Important note, in all occurrences we can see server just finished unmounting.

      In fact in one case we can see some other unhappiness:

      [51785.157945] Lustre: server umount lustre-MDT0001 complete
      [51786.124008] LustreError: 24709:0:(llog_cat.c:576:llog_cat_add_rec()) llog_write_rec -5: lh=ffff8801bd26ae00
      [51786.134062] BUG: sleeping function called from invalid context at kernel/rwsem.c:51
      [51786.134064] in_atomic(): 1, irqs_disabled(): 0, pid: 23874, name: umount
      [51786.134088] CPU: 3 PID: 23874 Comm: umount Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
      [51786.134090] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [51786.134092] Call Trace:
      [51786.134183]  [<ffffffff817afbf2>] dump_stack+0x19/0x1b
      [51786.134221]  [<ffffffff810c3bc9>] __might_sleep+0xd9/0x100
      [51786.134226]  [<ffffffff817b6460>] down_write+0x20/0x50
      [51786.134363]  [<ffffffffa11b1037>] osp_invalidate+0x177/0x210 [osp]
      [51786.134391]  [<ffffffffa11c68a3>] osp_trans_stop_cb+0x133/0x180 [osp]
      [51786.134399]  [<ffffffffa11c9647>] osp_trans_callback+0xa7/0xc0 [osp]
      [51786.134403]  [<ffffffffa11a72c8>] osp_update_fini+0xc8/0x280 [osp]
      [51786.134414]  [<ffffffff810b6050>] ? wake_up_atomic_t+0x30/0x30
      [51786.134422]  [<ffffffffa11a76e2>] osp_process_config+0x262/0x560 [osp]
      [51786.134559]  [<ffffffffa1124248>] lod_sub_process_config+0xe8/0x1e0 [lod]
      [51786.134591]  [<ffffffffa112b650>] lod_process_config+0x4c0/0x1420 [lod]
      [51786.134624]  [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0
      [51786.134626]  [<ffffffff812127f4>] ? __kmalloc+0x634/0x660
      [51786.134732]  [<ffffffffa0ff2bb7>] ? mdd_iocontrol+0x327/0x1430 [mdd]
      [51786.134739]  [<ffffffffa0fedf68>] mdd_process_config+0x88/0x5d0 [mdd]
      [51786.134937]  [<ffffffffa105831f>] mdt_device_fini+0x2df/0xfc0 [mdt]
      [51786.135010]  [<ffffffffa036393c>] class_cleanup+0x55c/0xbb0 [obdclass]
      [51786.135807]  [<ffffffffa0364c0c>] class_process_config+0x65c/0x2800 [obdclass]
      [51786.136295]  [<ffffffff8120f458>] ? cache_alloc_debugcheck_after.isra.45+0x68/0x2d0
      [51786.136315]  [<ffffffff812127f4>] ? __kmalloc+0x634/0x660
      [51786.136434]  [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [51786.136491]  [<ffffffffa0366f76>] class_manual_cleanup+0x1c6/0x6d0 [obdclass]
      [51786.136660]  [<ffffffffa03966ee>] server_put_super+0x8ae/0xca0 [obdclass]
      [51786.136673]  [<ffffffff81238f5a>] generic_shutdown_super+0x6a/0xf0
      [51786.136676]  [<ffffffff81239342>] kill_anon_super+0x12/0x20
      [51786.136695]  [<ffffffffa03699b2>] lustre_kill_super+0x32/0x50 [obdclass]
      [51786.136697]  [<ffffffff81239739>] deactivate_locked_super+0x49/0x60
      [51786.136699]  [<ffffffff81239e86>] deactivate_super+0x46/0x60
      [51786.136726]  [<ffffffff81258f6f>] cleanup_mnt+0x3f/0x80
      [51786.136729]  [<ffffffff81259002>] __cleanup_mnt+0x12/0x20
      [51786.137097]  [<ffffffff810b1885>] task_work_run+0xb5/0xf0
      [51786.137168]  [<ffffffff8102bc22>] do_notify_resume+0x92/0xb0
      [51786.137185]  [<ffffffff817c5158>] int_signal+0x12/0x17
      [51786.202373] BUG: unable to handle kernel paging request at ffff8801bd26ae30
      [51786.203262] IP: [<ffffffff810ba28b>] up_write+0xb/0x30
      [51786.203262] PGD 241b067 PUD 33f3fe067 PMD 33f214067 PTE 80000001bd26a060
      [51786.203262] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      [51786.203262] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common pcspkr virtio_balloon i2c_piix4 virtio_console ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm drm_panel_orientation_quirks ata_piix i2c_core virtio_blk serio_raw libata floppy [last unloaded: libcfs]
      [51786.203262] CPU: 9 PID: 24709 Comm: mdt04_010 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
      [51786.203262] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [51786.203262] task: ffff8800b49cc3c0 ti: ffff88024e038000 task.ti: ffff88024e038000
      [51786.203262] RIP: 0010:[<ffffffff810ba28b>]  [<ffffffff810ba28b>] up_write+0xb/0x30
      [51786.203262] RSP: 0018:ffff88024e03b8e0  EFLAGS: 00010246
      [51786.203262] RAX: 0000000000000000 RBX: ffff8801bd26ae00 RCX: 0000000000000000
      [51786.203262] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffff8801bd26ae00
      [51786.203262] RBP: ffff88024e03b930 R08: 0000000000000000 R09: 0000000000001933
      [51786.203262] R10: 0000000000000198 R11: 0000000000000198 R12: ffff8800ac585e00
      [51786.203262] R13: 0000000000000000 R14: 00000000fffffffb R15: ffff880237459540
      [51786.203262] FS:  0000000000000000(0000) GS:ffff88033dc40000(0000) knlGS:0000000000000000
      [51786.203262] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [51786.203262] CR2: ffff8801bd26ae30 CR3: 0000000001c10000 CR4: 00000000000006e0
      [51786.203262] Call Trace:
      [51786.203262]  [<ffffffffa0330fba>] ? llog_cat_add_rec+0x35a/0x8a0 [obdclass]
      [51786.203262]  [<ffffffffa03294aa>] llog_add+0x7a/0x1a0 [obdclass]
      [51786.203262]  [<ffffffffa0357119>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [51786.203262]  [<ffffffffa06407c3>] sub_updates_write+0x2b9/0xcb6 [ptlrpc]
      [51786.203262]  [<ffffffffa06244cf>] top_trans_stop+0x65f/0xa10 [ptlrpc]
      [51786.203262]  [<ffffffffa1126b3c>] lod_trans_stop+0x25c/0x340 [lod]
      [51786.203262]  [<ffffffffa101d108>] mdd_trans_stop+0x28/0x16e [mdd]
      [51786.203262]  [<ffffffffa1015701>] mdd_xattr_set+0x371/0x1790 [mdd]
      [51786.203262]  [<ffffffffa1076bb7>] ? mdt_version_save+0x67/0x120 [mdt]
      [51786.203262]  [<ffffffffa1086215>] mdt_reint_setxattr+0x445/0xf90 [mdt]
      [51786.203262]  [<ffffffffa1083ba0>] mdt_reint_rec+0x80/0x210 [mdt]
      [51786.203262]  [<ffffffffa10607e3>] mdt_reint_internal+0x6e3/0xab0 [mdt]
      [51786.203262]  [<ffffffffa1068967>] ? mdt_thread_info_init+0xa7/0x1e0 [mdt]
      [51786.203262]  [<ffffffffa106b927>] mdt_reint+0x67/0x140 [mdt]
      [51786.203262]  [<ffffffffa0613525>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
      [51786.203262]  [<ffffffffa020cfa7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [51786.203262]  [<ffffffffa05b6369>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
      [51786.311938]  [<ffffffff810bfbd8>] ? __wake_up_common+0x58/0x90
      [51786.311938]  [<ffffffff813fb7bb>] ? do_raw_spin_unlock+0x4b/0x90
      [51786.311938]  [<ffffffffa05ba32c>] ptlrpc_main+0xb5c/0x2040 [ptlrpc]
      [51786.311938]  [<ffffffffa05b97d0>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
      [51786.311938]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
      [51786.311938]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      [51786.311938]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
      [51786.311938]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
      

      so it seems like the freeing is happening from another thread.

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: