Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-8253 [ EX-8253 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to DDN-4234 [ DDN-4234 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/
            Subject: LU-11457 osd-ldiskfs: scrub FID reuse
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5c59f8551a96ec0b3bbb935a3173c75e878d9ff0

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/ Subject: LU-11457 osd-ldiskfs: scrub FID reuse Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5c59f8551a96ec0b3bbb935a3173c75e878d9ff0

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601
            Subject: LU-11457 osd-ldiskfs: scrub FID reuse
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a2627b8c26d5b0d5580971f227d1cb11d8ae8669

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601 Subject: LU-11457 osd-ldiskfs: scrub FID reuse Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a2627b8c26d5b0d5580971f227d1cb11d8ae8669
            pjones Peter Jones made changes -
            Link New: This issue is related to DDN-3995 [ DDN-3995 ]

            This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content (lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in LU-13124 and check multiply-linked objects, but in this case it should detect the path/FID mismatch and either:

            • replace the "new" object if it has no blocks, but "bad" object has blocks
            • remove the "bad" object has no blocks or the same blocks, it should be removed
            adilger Andreas Dilger added a comment - This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content ( lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in LU-13124 and check multiply-linked objects, but in this case it should detect the path/FID mismatch and either: replace the "new" object if it has no blocks, but "bad" object has blocks remove the "bad" object has no blocks or the same blocks, it should be removed
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-2690 [ DDN-2690 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-370 [ DDN-370 ]

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: