Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1110

MDS Oops in osd_xattr_get() during file open by FID

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.3.0
    • Lustre 2.1.0
    • 3
    • 4677

    Description

      MDS crashed solid/multiple-times before we were able to identify the concerned File/FID and fix the situation by unlink'ing it !!!

      The panic stack looks like following :
      ======================================
      BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
      IP: [<ffffffffa0b0f05d>] osd_xattr_get+0x7d/0x170 [osd_ldiskfs]
      PGD 0
      Oops: 0000 1 SMP
      last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:86:00.1/host14/rport-14:0-0/target14:0:0/14:0:0:2/timeout
      CPU 4
      Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed sg lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]

      Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin dm_multipath usbhid hid ghes i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma hed sg lpfc scsi_transport_fc scsi_tgt igb dca ext4 jbd2 sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode]
      Pid: 11774, comm: mdt_58 Tainted: G ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1 bullx super-node
      RSP: 0018:ffff8804d21ef490 EFLAGS: 00010202
      RIP: 0010:[<ffffffffa0b0f05d>] [<ffffffffa0b0f05d>] osd_xattr_get+0x7d/0x170 [osd_ldiskfs]
      RSP: 0018:ffff8804d21ef490 EFLAGS: 00010202
      RAX: ffff8808592058c0 RBX: ffff8804d21efe90 RCX: ffffffffa0a824e6
      RDX: ffff88105b220dc0 RSI: ffffffffa0b1b1e0 RDI: ffff8804d21efe90
      RBP: ffff8804d21ef4e0 R08: 0000000000000000 R09: ffffffffa0b17900
      R10: ffff8808592058c0 R11: ffff8808592058d0 R12: ffff8808599d1300
      R13: 0000000000000000 R14: ffff8808599d1300 R15: ffff8810598ca3a8
      FS: 00007fd36d430700(0000) GS:ffff880044e40000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000000000f8 CR3: 00000018561a6000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process mdt_58 (pid: 11774, threadinfo ffff8804d21ec000, task ffff8804d21e9740)
      Stack:
      ffff8808586e1800 ffffffffa0a824e6 ffff88105ad26000 0000000000000000
      <0> ffff88087fc103c0 ffff8804d21efe90 ffff8808599d13c0 0000000000000000
      <0> ffff8804d21ef64c ffff8808586e1800 ffff8804d21ef560 ffffffffa0a564e6
      Call Trace:
      [<ffffffffa0a564e6>] mdd_get_md+0x96/0x350 [mdd]
      [<ffffffff8147d876>] ? down_read+0x16/0x30
      [<ffffffffa0a567ec>] mdd_get_md_locked+0x4c/0x70 [mdd]
      [<ffffffffa0a5a563>] mdd_lov_create+0xc43/0x21f0 [mdd]
      [<ffffffffa0a64a2e>] mdd_create_data+0x37e/0x5c0 [mdd]
      [<ffffffffa0a79926>] ? mdd_read_unlock+0x26/0x30 [mdd]
      [<ffffffffa0b35056>] cml_create_data+0xb6/0x260 [cmm]
      [<ffffffffa0b36c29>] ? cml_xattr_get+0x89/0x1d0 [cmm]
      [<ffffffffa06a3190>] ? lustre_swab_mdt_body+0x0/0x150 [ptlrpc]
      [<ffffffffa0ad0c22>] mdt_finish_open+0x13c2/0x18e0 [mdt]
      [<ffffffffa0b3460f>] ? cml_attr_get+0x7f/0x1c0 [cmm]
      [<ffffffffa0ad2b39>] mdt_reint_open+0x19f9/0x2c50 [mdt]
      [<ffffffffa0a71ff6>] ? md_ucred+0x26/0x60 [mdd]
      [<ffffffffa0a9e5f5>] ? mdt_ucred+0x15/0x20 [mdt]
      [<ffffffffa0ab586f>] ? mdt_root_squash+0x2f/0x450 [mdt]
      [<ffffffffa0abaabf>] mdt_reint_rec+0x3f/0x100 [mdt]
      [<ffffffffa06a0b54>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc]
      [<ffffffffa0ab2f64>] mdt_reint_internal+0x6d4/0x9f0 [mdt]
      [<ffffffffa0aa057e>] ? mdt_intent_fixup_resent+0x4e/0x270 [mdt]
      [<ffffffffa0ab35e5>] mdt_intent_reint+0x245/0x600 [mdt]
      [<ffffffffa04f4625>] ? cfs_hash_bd_lookup_intent+0xe5/0x130 [libcfs]
      [<ffffffffa06a1f50>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [<ffffffffa0aab770>] mdt_intent_policy+0x3c0/0x6b0 [mdt]
      [<ffffffff81042890>] ? fair_enqueue_task_fair+0x190/0x350
      [<ffffffffa058c521>] ? class_handle_hash+0xa1/0x280 [obdclass]
      [<ffffffffa0659afa>] ldlm_lock_enqueue+0x2da/0xa50 [ptlrpc]
      [<ffffffffa0678305>] ? ldlm_export_lock_get+0x15/0x20 [ptlrpc]
      [<ffffffffa04f3692>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
      [<ffffffffa0680227>] ldlm_handle_enqueue0+0x447/0x1090 [ptlrpc]
      [<ffffffffa0aa6f81>] ? mdt_unpack_req_pack_rep+0x51/0x5d0 [mdt]
      [<ffffffffa0aab2ea>] mdt_enqueue+0x4a/0x110 [mdt]
      [<ffffffffa0aa7dd5>] mdt_handle_common+0x8d5/0x1810 [mdt]
      [<ffffffffa069e2d4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      [<ffffffffa0aa8de5>] mdt_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa06af019>] ptlrpc_main+0xc79/0x19d0 [ptlrpc]
      [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
      [<ffffffffa06ae3a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa06ae3a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
      [<ffffffffa06ae3a0>] ? ptlrpc_main+0x0/0x19d0 [ptlrpc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20
      Code: 45 c0 49 8b 04 24 f6 40 20 03 75 1f b9 f9 07 00 00 48 c7 c2 57 81 b1 a0 48 c7 c6 e8 8c b1 a0 48 c7 c7 75 87 b1 a0 e8 83 1d 9e ff <49> 8b 85 f8 00 00 00 48 85 c0 0f 84 84 00 00 00 48 83 b8 90 00
      RIP [<ffffffffa0b0f05d>] osd_xattr_get+0x7d/0x170 [osd_ldiskfs]
      RSP <ffff8804d21ef490>
      CR2: 00000000000000f8
      ======================================

      The reason of the Oops is because osd_xattr_get() references obj->oo_inode which is NULL.

      This situation should be inherited from osd_fid_lookup() which returns 0 if osd_oi_lookup() returns ENOENT and does not initialize oo_inode ...

      This comes from extensive "by-FID" direct operations used by customer's tools which may trigger some FileSystem inconsistencies (causing a FID not to be correctly resolved ??...) not beeing handled in "by-FID" access-method code.

      And 1st question coming to my mind, is "by-FID" feature/access-method already available and safe for customer's usage ???

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            louveta Alexandre Louvet (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: