Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6719

(osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

Details

    • 3
    • 9223372036854775807

    Description

      On at least one of our MDTs, we have gotten the following error many times:

      2015-05-27 14:57:28 LustreError: 14118:0:(osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

      -17 is EEXIST.

      There are no other Lustre messages in the console logs anywhere near these messages.

      We are using ZFS OSDs.

      Lustre version 2.5.3-5chaos. (See github.com/chaos/lustre)

      If this is a bug, then please work on a fix. If this is not a bug, then please silence the scary message.

      Attachments

        Activity

          [LU-6719] (osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

          Could you please provide a backported patch for Lustre 2.5?

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - Could you please provide a backported patch for Lustre 2.5?

          Patch has landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18054/
          Subject: LU-6719 osd-zfs: Ignore EEXIST during object init
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d96c38aa85d80f5e0630c28d242f8279987c8d5e

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18054/ Subject: LU-6719 osd-zfs: Ignore EEXIST during object init Project: fs/lustre-release Branch: master Current Patch Set: Commit: d96c38aa85d80f5e0630c28d242f8279987c8d5e

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/18054
          Subject: LU-6719 obdclass: Retry lookup
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: f6c8ff3e40e8fa403d23192a3b8f819a6f8a0e77

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/18054 Subject: LU-6719 obdclass: Retry lookup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f6c8ff3e40e8fa403d23192a3b8f819a6f8a0e77

          With +info added to debugging, the logs cycle rather quickly. Hopefully there is enough info to find where the code is having problems.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - With +info added to debugging, the logs cycle rather quickly. Hopefully there is enough info to find where the code is having problems.

          Quick triage of debug_log attached, I find one instance of the message. I've pulled out all the references to that fid from the log:

          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:2488:mdt_object_new()) Allocate object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 1
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [23949] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1
          Sun Jan  3 20:38:12 2016 [23911] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1570 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [24058] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1
          Sun Jan  3 20:38:12 2016 [23813] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [21992] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (osd_object.c:410:osd_object_init()) lse-MDT0000: lookup [0x20006c6de:0x1c752:0x0]/0x4649efde failed: rc = -17
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          
          utopiabound Nathaniel Clark added a comment - Quick triage of debug_log attached, I find one instance of the message. I've pulled out all the references to that fid from the log: Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:2488:mdt_object_new()) Allocate object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 1 Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [23949] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1 Sun Jan 3 20:38:12 2016 [23911] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1570 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [24058] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1 Sun Jan 3 20:38:12 2016 [23813] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [21992] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (osd_object.c:410:osd_object_init()) lse-MDT0000: lookup [0x20006c6de:0x1c752:0x0]/0x4649efde failed: rc = -17 Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]

          I have attached a debug log with +info added to debugging.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I have attached a debug log with +info added to debugging.

          Yes. That would be helpful to figure out how best to resolve this issue.

          utopiabound Nathaniel Clark added a comment - Yes. That would be helpful to figure out how best to resolve this issue.

          This is happening fairly regularly. Would you like me to enable +info for a day and capture some dumps?

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - This is happening fairly regularly. Would you like me to enable +info for a day and capture some dumps?

          It appears from my reading that lustre/obdclass/lu_object.c::lu_object_find_try() does the following:

          1. lock hash
            • Check to see if fid has an object in has
          2. unlock hash
          3. if object not in hash, create an object and initialize it
          4. lock hash
            • Check to see if object is in hash, if not add it, if it is destroy new object
          5. unlock hash
          6. return object that was in cache

          It looks like multiple attempts can be made to create the same FID, thus resulting in an error being returned because the first one succeeds, and later ones can't initialize correctly.

          If you traced with +info, I'm pretty sure you'd see either mdt_object_new() or mdt_object_find() directly before this error message.

          Possible issue: lu_object_find_try() nor lu_object_find_at() above it will retry the lookup on an EEXIST

          utopiabound Nathaniel Clark added a comment - It appears from my reading that lustre/obdclass/lu_object.c::lu_object_find_try() does the following: lock hash Check to see if fid has an object in has unlock hash if object not in hash, create an object and initialize it lock hash Check to see if object is in hash, if not add it, if it is destroy new object unlock hash return object that was in cache It looks like multiple attempts can be made to create the same FID, thus resulting in an error being returned because the first one succeeds, and later ones can't initialize correctly. If you traced with +info, I'm pretty sure you'd see either mdt_object_new() or mdt_object_find() directly before this error message. Possible issue: lu_object_find_try() nor lu_object_find_at() above it will retry the lookup on an EEXIST

          People

            utopiabound Nathaniel Clark
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: