Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 2.5.3-2.6.32_431.29.2.el6.atlas.x86_64_g57d5785
    • 3
    • 9223372036854775807

    Description

      This morning a production MDS hit an assertion:

      <0>[2551157.740086] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) ASSERTION( dir->i_op != ((void *)0) && dir->i_op->lookup != ((void *)0) ) failed: 
      <0>[2551157.756253] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) LBUG
      <4>[2551157.764766] Pid: 14993, comm: mdt01_094
      <4>[2551157.769360] 
      <4>[2551157.769361] Call Trace:
      <4>[2551157.774374]  [<ffffffffa0409895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4>[2551157.782474]  [<ffffffffa0409e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4>[2551157.789707]  [<ffffffffa0ca8fcf>] osd_index_ea_lookup+0x6ff/0x8a0 [osd_ldiskfs]
      <4>[2551157.798308]  [<ffffffffa0d0dde0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      <4>[2551157.805733]  [<ffffffffa088c7c0>] ? lod_index_lookup+0x0/0x30 [lod]
      <4>[2551157.813056]  [<ffffffffa088c7e5>] lod_index_lookup+0x25/0x30 [lod]
      <4>[2551157.820291]  [<ffffffffa0dd0daa>] __mdd_lookup+0x24a/0x440 [mdd]
      <4>[2551157.827325]  [<ffffffffa0dd1599>] mdd_lookup+0x39/0xe0 [mdd]
      <4>[2551157.833977]  [<ffffffffa0d3bee5>] ? mdt_name+0x35/0xc0 [mdt]
      <4>[2551157.840629]  [<ffffffffa0d44b09>] mdt_reint_open+0xb69/0x21a0 [mdt]
      <4>[2551157.847959]  [<ffffffffa0426376>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      <4>[2551157.856570]  [<ffffffffa05c7a80>] ? lu_ucred+0x20/0x30 [obdclass]
      <4>[2551157.863705]  [<ffffffffa0d2d481>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4>[2551157.870643]  [<ffffffffa0d12ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4>[2551157.878254]  [<ffffffffa0d1345e>] mdt_intent_reint+0x1ee/0x410 [mdt]
      <4>[2551157.885669]  [<ffffffffa0d10c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      <4>[2551157.893212]  [<ffffffffa06e42e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      <4>[2551157.901044]  [<ffffffffa070de2b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      <4>[2551157.909336]  [<ffffffffa0d11106>] mdt_enqueue+0x46/0xe0 [mdt]
      <4>[2551157.916083]  [<ffffffffa0d15ada>] mdt_handle_common+0x52a/0x1470 [mdt]
      <4>[2551157.923701]  [<ffffffffa0d52595>] mds_regular_handle+0x15/0x20 [mdt]
      <4>[2551157.931144]  [<ffffffffa073cf25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      <4>[2551157.940128]  [<ffffffffa040a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4>[2551157.947452]  [<ffffffffa041b7c5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      <4>[2551157.955380]  [<ffffffffa07358f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      <4>[2551157.963287]  [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      <4>[2551157.970142]  [<ffffffffa073f6ed>] ptlrpc_main+0xaed/0x1930 [ptlrpc]
      <4>[2551157.977487]  [<ffffffffa073ec00>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
      <4>[2551157.984809]  [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4>[2551157.990580]  [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4>[2551157.998367]  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4>[2551158.004229]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>[2551158.010272] 
      <0>[2551158.012746] Kernel panic - not syncing: LBUG
      

      Attachments

        Activity

          [LU-6996] osd_ea_lookup_rec assertion
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/
          Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 7aaa680b7f22e7dfaac8af38b78d89164a94e842

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/ Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7aaa680b7f22e7dfaac8af38b78d89164a94e842
          yujian Jian Yu added a comment -

          Hi Alex,

          Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

          yujian Jian Yu added a comment - Hi Alex, Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157
          Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137
          Subject: LU-6996 osd: test b2_5 base
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137 Subject: LU-6996 osd: test b2_5 base Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

          On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket.

          Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

          yong.fan nasf (Inactive) added a comment - On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket. Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026
          Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0
          yujian Jian Yu added a comment -

          Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

          yujian Jian Yu added a comment - Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.
          ezell Matt Ezell added a comment -

          The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

          ezell Matt Ezell added a comment - The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.
          pjones Peter Jones added a comment -

          Alex

          Could you please look into this issue?

          Thanks

          Peter

          pjones Peter Jones added a comment - Alex Could you please look into this issue? Thanks Peter

          People

            bzzz Alex Zhuravlev
            ezell Matt Ezell
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: