Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6719

(osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

Details

    • 3
    • 9223372036854775807

    Description

      On at least one of our MDTs, we have gotten the following error many times:

      2015-05-27 14:57:28 LustreError: 14118:0:(osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

      -17 is EEXIST.

      There are no other Lustre messages in the console logs anywhere near these messages.

      We are using ZFS OSDs.

      Lustre version 2.5.3-5chaos. (See github.com/chaos/lustre)

      If this is a bug, then please work on a fix. If this is not a bug, then please silence the scary message.

      Attachments

        Activity

          [LU-6719] (osd_object.c:410:osd_object_init()) lsd-MDT0000: lookup [0x200042d88:0x1885d:0x0]/0x45f89372 failed: rc = -17

          With +info added to debugging, the logs cycle rather quickly. Hopefully there is enough info to find where the code is having problems.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - With +info added to debugging, the logs cycle rather quickly. Hopefully there is enough info to find where the code is having problems.

          Quick triage of debug_log attached, I find one instance of the message. I've pulled out all the references to that fid from the log:

          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:2488:mdt_object_new()) Allocate object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 1
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [23949] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [23949] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [23949] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1
          Sun Jan  3 20:38:12 2016 [23911] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1570 pid: 23949 timeout: 0 lvb_type: 0 used 402457368
          Sun Jan  3 20:38:12 2016 [24058] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1
          Sun Jan  3 20:38:12 2016 [23813] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [21992] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2
          Sun Jan  3 20:38:12 2016 [24058] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          Sun Jan  3 20:38:12 2016 [23837] (osd_object.c:410:osd_object_init()) lse-MDT0000: lookup [0x20006c6de:0x1c752:0x0]/0x4649efde failed: rc = -17
          Sun Jan  3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]
          
          utopiabound Nathaniel Clark added a comment - Quick triage of debug_log attached, I find one instance of the message. I've pulled out all the references to that fid from the log: Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:2488:mdt_object_new()) Allocate object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 1 Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [23949] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [23949] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1571 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [23949] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1 Sun Jan 3 20:38:12 2016 [23911] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (ldlm_resource.c:1521:ldlm_resource_dump()) ### ### ns: mdt-lse-MDT0000_UUID lock: ffff880856255d00/0xee63e92424c3a04e lrc: 2/0,0 mode: CR/CR res: [0x20006c6de:0x1c752:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x40200000000000 nid: 192.168.115.26@o2ib10 remote: 0x1e9fdd93a70839d9 expref: 1570 pid: 23949 timeout: 0 lvb_type: 0 used 402457368 Sun Jan 3 20:38:12 2016 [24058] (mdt_lib.c:545:mdt_dump_lmm()) objid 0x1c752:444126, magic 0x0BD10BD0, pattern 0X1 Sun Jan 3 20:38:12 2016 [23813] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:2505:mdt_object_find()) Find object for [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5322:mdt_object_init()) object init, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [21992] (ldlm_resource.c:1515:ldlm_resource_dump()) --- Resource: [0x20006c6de:0x1c752:0x0].0 (ffff880f801a1980) refcount = 2 Sun Jan 3 20:38:12 2016 [24058] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0] Sun Jan 3 20:38:12 2016 [23837] (osd_object.c:410:osd_object_init()) lse-MDT0000: lookup [0x20006c6de:0x1c752:0x0]/0x4649efde failed: rc = -17 Sun Jan 3 20:38:12 2016 [23837] (mdt_handler.c:5342:mdt_object_free()) object free, fid = [0x20006c6de:0x1c752:0x0]

          I have attached a debug log with +info added to debugging.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I have attached a debug log with +info added to debugging.

          Yes. That would be helpful to figure out how best to resolve this issue.

          utopiabound Nathaniel Clark added a comment - Yes. That would be helpful to figure out how best to resolve this issue.

          This is happening fairly regularly. Would you like me to enable +info for a day and capture some dumps?

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - This is happening fairly regularly. Would you like me to enable +info for a day and capture some dumps?

          It appears from my reading that lustre/obdclass/lu_object.c::lu_object_find_try() does the following:

          1. lock hash
            • Check to see if fid has an object in has
          2. unlock hash
          3. if object not in hash, create an object and initialize it
          4. lock hash
            • Check to see if object is in hash, if not add it, if it is destroy new object
          5. unlock hash
          6. return object that was in cache

          It looks like multiple attempts can be made to create the same FID, thus resulting in an error being returned because the first one succeeds, and later ones can't initialize correctly.

          If you traced with +info, I'm pretty sure you'd see either mdt_object_new() or mdt_object_find() directly before this error message.

          Possible issue: lu_object_find_try() nor lu_object_find_at() above it will retry the lookup on an EEXIST

          utopiabound Nathaniel Clark added a comment - It appears from my reading that lustre/obdclass/lu_object.c::lu_object_find_try() does the following: lock hash Check to see if fid has an object in has unlock hash if object not in hash, create an object and initialize it lock hash Check to see if object is in hash, if not add it, if it is destroy new object unlock hash return object that was in cache It looks like multiple attempts can be made to create the same FID, thus resulting in an error being returned because the first one succeeds, and later ones can't initialize correctly. If you traced with +info, I'm pretty sure you'd see either mdt_object_new() or mdt_object_find() directly before this error message. Possible issue: lu_object_find_try() nor lu_object_find_at() above it will retry the lookup on an EEXIST

          -1 on the MDS is probably not a good idea for very long.

          Nathaniel, what have you learned from the code so far? Can you explain why the server calls osd_object_init() on an existing object? Under what conditions does it fail to check if the object exists before calling the initialization function?

          morrone Christopher Morrone (Inactive) added a comment - -1 on the MDS is probably not a good idea for very long. Nathaniel, what have you learned from the code so far? Can you explain why the server calls osd_object_init() on an existing object? Under what conditions does it fail to check if the object exists before calling the initialization function?

          No, just what I've given you. It seems to happen pretty often. I could increase the debugging level on the servers for a few days and wait for another one to trigger a dump. Do you want any specific levels added, rather then just a "-1"? We are running with:

          ioctl neterror warning error emerg config console

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - No, just what I've given you. It seems to happen pretty often. I could increase the debugging level on the servers for a few days and wait for another one to trigger a dump. Do you want any specific levels added, rather then just a "-1"? We are running with: ioctl neterror warning error emerg config console
          utopiabound Nathaniel Clark added a comment - - edited

          There is amazingly little in the debug log of interest. I don't suppose the messages file or dmesg has anything more interesting when this happens?

          utopiabound Nathaniel Clark added a comment - - edited There is amazingly little in the debug log of interest. I don't suppose the messages file or dmesg has anything more interesting when this happens?

          I have a few more log dumps if you need/want them. Were you able to get anything interesting from the logs I uploaded?

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - I have a few more log dumps if you need/want them. Were you able to get anything interesting from the logs I uploaded?

          Here is the MDS. If you want the OSS logs, I can tar them up and attach as well. I cannot capture the client logs, as I don't have a suitable means to trigger a dump when it happens on the client side. Hopefully this helps.

          marc@llnl.gov D. Marc Stearman (Inactive) added a comment - Here is the MDS. If you want the OSS logs, I can tar them up and attach as well. I cannot capture the client logs, as I don't have a suitable means to trigger a dump when it happens on the client side. Hopefully this helps.

          People

            utopiabound Nathaniel Clark
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: