Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3380

Failure to activate deactivated OST

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • Red Hat Enterprise Linux Server release 5.5 (Tikanga), Kernel 2.6.18-108redsky_chaos, Lustre:
      lustre: 1.8.5
      kernel: patchless_client
      build: 1.8.5-6chaos
    • 3
    • 8366

    Description

      We are observing a failure to activate one Lustre 1.8.5 target in order to run 'lfs quota -u ...'. Five OSTs have been deactivated for performance reasons, and four of five activate correctly but one (OST0008) continuously fails to activate, despite fsck'ing both the OST and the MDT. The sequence and errors are shown below. We are aware that both kernel and lustre versions are terribly old and we are in the process of getting to version 1.8.9, and then 2.1.x. Any assistance or suggestions is appreciated.

      Deactivated targets:
      7 IN osc scratch1-OST0002-osc scratch1-mdtlov_UUID 5
      9 IN osc scratch1-OST0004-osc scratch1-mdtlov_UUID 5
      13 IN osc scratch1-OST0008-osc scratch1-mdtlov_UUID 5
      15 IN osc scratch1-OST000a-osc scratch1-mdtlov_UUID 5
      28 IN osc scratch1-OST0017-osc scratch1-mdtlov_UUID 5

      Attempt to activate targets on the MDS results in OST0008 remaining inactive:
      13 IN osc scratch1-OST0008-osc scratch1-mdtlov_UUID 5

      Errors on OSS:
      2013-05-22 12:11:45 xxxxxx [47215.258366] Lustre: scratch1-OST0008: received MDS connection from xx.x.xx.x@o2ib <kern.warning>
      2013-05-22 12:11:45 xxxxx [47215.259576] LustreError: 8201:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0008: ignoring bogus orphan destroy request: obdid 15039753 last_id 15086581 <kern.err>

      Errors on MDS:
      2013-05-22 12:11:45 xxxxx [1792441.005047] LustreError: 10973:0:(osc_create.c:589:osc_create()) scratch1-OST0008-osc: oscc recovery failed: -22 <kern.err>
      2013-05-22 12:11:45 xxxxx [1792441.005056] Lustre: scratch1-OST0008_UUID: Failed to clear orphan objects on OST: -22 <kern.warning>
      2013-05-22 12:11:45 xxxxx [1792441.005058] Lustre: scratch1-OST0008_UUID: Sync failed deactivating: rc -22 <kern.warning>

      Attachments

        Activity

          [LU-3380] Failure to activate deactivated OST

          Duplicated with LU-2018.

          niu Niu Yawei (Inactive) added a comment - Duplicated with LU-2018 .

          Right, we should use the last object ID as standard in this case. Thanks, Chris.

          niu Niu Yawei (Inactive) added a comment - Right, we should use the last object ID as standard in this case. Thanks, Chris.

          The issue was similar to LU-2018. The difference was that the lov_objid, last on-disk object ID, and LAST_ID were all different, and the MDT lov_objid for that OST and the on-disk last object ID for that target were very close, and the MDT lov_objid was advanced of the on-disk last object ID. The LAST_ID on the OST was way off, so I reset that to the last on-disk object ID and that corrected the issue. If this is the wrong solution to this problem, please let me know. It seems to me that the last on-disk object ID should be the gold standard, so I used that. Unless I have made an incorrect assumption, this may be closed.

          beggio Chirstopher Beggio added a comment - The issue was similar to LU-2018 . The difference was that the lov_objid, last on-disk object ID, and LAST_ID were all different, and the MDT lov_objid for that OST and the on-disk last object ID for that target were very close, and the MDT lov_objid was advanced of the on-disk last object ID. The LAST_ID on the OST was way off, so I reset that to the last on-disk object ID and that corrected the issue. If this is the wrong solution to this problem, please let me know. It seems to me that the last on-disk object ID should be the gold standard, so I used that. Unless I have made an incorrect assumption, this may be closed.

          yes, it looks same problem of LU-2018, you need to make sure if the lov_objid on MDS or the last_id on OST is wrong, and change them correctly following the instructions described in LU-2018. Thanks.

          niu Niu Yawei (Inactive) added a comment - yes, it looks same problem of LU-2018 , you need to make sure if the lov_objid on MDS or the last_id on OST is wrong, and change them correctly following the instructions described in LU-2018 . Thanks.

          It appears this problem and solution are documented in LU-2018. I will review and implement that fix and report the result.

          beggio Chirstopher Beggio added a comment - It appears this problem and solution are documented in LU-2018 . I will review and implement that fix and report the result.

          I should add that the target has also been remounted aborting recovery (-o abort_recov) to eliminate that as a possible solution.

          beggio Chirstopher Beggio added a comment - I should add that the target has also been remounted aborting recovery (-o abort_recov) to eliminate that as a possible solution.
          pjones Peter Jones added a comment -

          Niu

          Could you please comment on this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Niu Could you please comment on this one? Thanks Peter

          People

            niu Niu Yawei (Inactive)
            beggio Chirstopher Beggio
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: