Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-61

MDT can't connect to OST after hardware event: oscc recovery failed: -116

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.6
    • None
    • None
    • 1
    • 10068

    Description

      Hi WC,

      There was a hardware failure at Purdue today that took out a 6620 controller. After fixing the issue, the MDT fails to connect to one OST and have intermittent connections with another. fid2dentry is getting passed an obd_id of 0 which causes it to return a ESTALE to the MDT. I looked in bz, but I couldn't find anything similar. Have you seen anything or do you have any ideas on how to get it online?

      We've tried rebooting the MDS and OSS, but after recovery it still has this issue. Would aborting recovery help? How about the CATALOGS trick? Let me know if other logs would help.

      Thanks,
      Kit

      Relevant MDT logs:
      Feb 4 17:50:53 mds-a01 kernel: [13485616.075071] LustreError: 12085:0:(osc_create.c:585:osc_create()) lustrefatal: invalid object id A-OST0001-osc: oscc recovery failed: -116
      Feb 4 17:50:53 mds-a01 kernel: [13485616.075526] LustreError: 12085:0:(lov_obd.c:1131:lov_clear_orphans()) error in orphan recovery on OST idx 1/36: rc = -116
      Feb 4 17:50:53 mds-a01 kernel: [13485616.076025] LustreError: 12085:0:(mds_lov.c:1062:__mds_lov_synchronize()) lustreA-OST0001_UUID failed at mds_lov_clear_orphans: -116
      Feb 4 17:50:53 mds-a01 kernel: [13485616.076482] LustreError: 12085:0:(mds_lov.c:1071:__mds_lov_synchronize()) lustreA-OST0001_UUID sync failed -116, deactivating
      Feb 4 17:51:39 mds-a01 kernel: [13485661.612612] LustreError: 12408:0:(osc_create.c:585:osc_create()) lustreA-OST0001-osc: oscc recovery failed: -116

      lctl dl
      ...
      28 UP osc lustreA-OST000b-osc lustreA-mdtlov_UUID 5
      29 UP osc lustreA-OST0000-osc lustreA-mdtlov_UUID 5
      30 IN osc lustreA-OST0001-osc lustreA-mdtlov_UUID 5
      31 UP osc lustreA-OST0002-osc lustreA-mdtlov_UUID 5
      32 UP osc lustreA-OST0003-osc lustreA-mdtlov_UUID 5
      ...

      Relevant OST logs:
      Feb 4 17:43:53 oss-a01 kernel: [ 1333.618994] LustreError: 10635:0:(filter.c:1428:filter_fid2dentry()) lustreA-OST0001: object 2283250:0 lookup error: rc -116
      Feb 4 17:43:53 oss-a01 kernel: [ 1333.619430] LustreError: 10635:0:(filter.c:1428:filter_fid2dentry()) Skipped 1 previous similar message
      Feb 4 17:43:55 oss-a01 kernel: [ 1336.503075] LustreError: 9981:0:(filter_lvb.c:90:filter_lvbo_init()) lustreA-OST0001: bad object 2283250/0: rc -116
      Feb 4 17:43:55 oss-a01 kernel: [ 1336.503630] LustreError: 9981:0:(ldlm_resource.c:860:ldlm_resource_add()) lvbo_init failed for resource 2283250: rc -116
      Feb 4 17:43:55 oss-a01 kernel: [ 1336.504092] LustreError: 9981:0:(ldlm_resource.c:860:ldlm_resource_add()) Skipped 37 previous similar messages

      Attachments

        Activity

          [LU-61] MDT can't connect to OST after hardware event: oscc recovery failed: -116
          pjones Peter Jones added a comment -

          As per Kit ok to close

          pjones Peter Jones added a comment - As per Kit ok to close

          Customer fsck'd OST, problem solved

          cliffw Cliff White (Inactive) added a comment - Customer fsck'd OST, problem solved

          Great! glad it sorted out so easy, I will close this

          cliffw Cliff White (Inactive) added a comment - Great! glad it sorted out so easy, I will close this

          That was it, oops! Sorry for not checking that earlier. Hopefully others can learn from my mistakes... Thanks for your help!

          kitwestneat Kit Westneat (Inactive) added a comment - That was it, oops! Sorry for not checking that earlier. Hopefully others can learn from my mistakes... Thanks for your help!

          Engineering confirms - you should run fsck the object 2283250 may be damaged.

          cliffw Cliff White (Inactive) added a comment - Engineering confirms - you should run fsck the object 2283250 may be damaged.

          The OSSes see IO errors around Feb 3 13:37:10. After the first reboot at Feb 3 14:46:41, Lustre isn't started again until Feb 4 17:27:43. That's when you can first see the -ESTALE errors.

          I'll ask about the customer if they have done an fsck, I thought they had, but maybe not.

          kitwestneat Kit Westneat (Inactive) added a comment - The OSSes see IO errors around Feb 3 13:37:10. After the first reboot at Feb 3 14:46:41, Lustre isn't started again until Feb 4 17:27:43. That's when you can first see the -ESTALE errors. I'll ask about the customer if they have done an fsck, I thought they had, but maybe not.

          I am consulting with engineering - have you run fsck on the OSTs? This issue may indicate an issue there.

          cliffw Cliff White (Inactive) added a comment - I am consulting with engineering - have you run fsck on the OSTs? This issue may indicate an issue there.

          There are 4 days of logs here. Can you tell me exactly when the issue started? When did you have the hardware failure?

          cliffw Cliff White (Inactive) added a comment - There are 4 days of logs here. Can you tell me exactly when the issue started? When did you have the hardware failure?

          logs from MDS and OSS before MDS and OSS reboot

          kitwestneat Kit Westneat (Inactive) added a comment - logs from MDS and OSS before MDS and OSS reboot

          Yes, I would try aborting recovery. It would be best to have the full log for a mount attempt.

          cliffw Cliff White (Inactive) added a comment - Yes, I would try aborting recovery. It would be best to have the full log for a mount attempt.

          People

            cliffw Cliff White (Inactive)
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: