Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5822

health_check file not updating properly

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.4
    • Lustre 2.5.3
    • 3
    • 16321

    Description

      Over the weekend we had an OST abort and get marked read-only:

      [  726.076561] LDISKFS-fs error (device dm-25): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 111692corrupted: 32768 blocks free in bitmap, 1024 - in gd
      [  726.116663] 
      [  726.125085] Aborting journal on device dm-25-8.
      [  726.133359] LustreError: 17032:0:(ofd_obd.c:1095:ofd_destroy()) f1-OST00ff: error destroying object [0x100000000:0x16546e7:0x0]: 0
      [  726.176268] LDISKFS-fs (dm-25): 
      [  726.179457] LDISKFS-fs error (device dm-25): ldiskfs_journal_start_sb: Detected aborted journal
      [  726.179459] LDISKFS-fs (dm-25): Remounting filesystem read-only

      We rely on the /proc/fs/lustre/health_check file to notify us of these situations. Unfortunately, we never got a notification. I found a bug in the b2_5 implementation of the osd-ldiskfs osd_statfs() function. Code inspection leads me to believe it does not affect master, but I haven't tried it there. I will upload a patch momentarily.

      Attachments

        Issue Links

          Activity

            [LU-5822] health_check file not updating properly
            pjones Peter Jones added a comment -

            Landed for 2.5.4

            pjones Peter Jones added a comment - Landed for 2.5.4

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12463/
            Subject: LU-5822 osd-ldiskfs: Correctly return OS_STATE_READONLY
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 1ff49a78e443f935670daf0c84b5b989c02dca04

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12463/ Subject: LU-5822 osd-ldiskfs: Correctly return OS_STATE_READONLY Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 1ff49a78e443f935670daf0c84b5b989c02dca04

            PS: I verified that this patch is not needed for master.

            adilger Andreas Dilger added a comment - PS: I verified that this patch is not needed for master.
            adilger Andreas Dilger added a comment - - edited

            There was some work done in LU-137 to allow ioctl() pass through to the underlying filesystem, but this was complicated in the 2.4+ releases by the OSD API and the addition of ZFS. While I would be happy to see that work move forward, but it is probably overkill for this.

            It would be easier to use the existing fault injection method for Lustre and add a new FAIL_LOC for this case, like:

                    if (sb->s_flags & MS_RDONLY ||
                        (OBD_FAIL_CHECK(OBD_FAIL_OSD_READONLY) &&
                         osd->od_jndex == libcfs_fail_val))
                            osd->od_statfs.os_state = OS_STATE_READONLY;
            

            or similar (this is just from the top of my head so the syntax might not be quite correct).

            adilger Andreas Dilger added a comment - - edited There was some work done in LU-137 to allow ioctl() pass through to the underlying filesystem, but this was complicated in the 2.4+ releases by the OSD API and the addition of ZFS. While I would be happy to see that work move forward, but it is probably overkill for this. It would be easier to use the existing fault injection method for Lustre and add a new FAIL_LOC for this case, like: if (sb->s_flags & MS_RDONLY || (OBD_FAIL_CHECK(OBD_FAIL_OSD_READONLY) && osd->od_jndex == libcfs_fail_val)) osd->od_statfs.os_state = OS_STATE_READONLY; or similar (this is just from the top of my head so the syntax might not be quite correct).

            Matt,

            Thanks for the patch. I'll look into a test to check that health_check is being updated properly.

            James

            jamesanunez James Nunez (Inactive) added a comment - Matt, Thanks for the patch. I'll look into a test to check that health_check is being updated properly. James
            ezell Matt Ezell added a comment - http://review.whamcloud.com/12463
            ezell Matt Ezell added a comment -

            I'd like to also add a test to make sure this doesn't break in the future, but I'm not sure of the best way to make the device go read only. I started with the read-only infrastructure that the test system uses, but that appears to set the underlying device to read-only, not the ldiskfs file system.

            First, I tried to implement a server_remount_fs() function so you could do 'mount -o remount,ro', but that gets a Lustre superblock that you would then need to call down into the osd-api to actually make it do anything to the underlying file system.
            I then looked at adding an IOCTL that lctl could call, but that also appears to require support from the osd-api.
            I have a prototype patch that adds a new osd-api method, dt_abort_device() that could be called from either remount or lctl, but I'm not sure if it makes sense to add a new method just for testing this. Thoughts?

            Is there an easier way to cause the underlying filesystem to abort or go read-only?

            ezell Matt Ezell added a comment - I'd like to also add a test to make sure this doesn't break in the future, but I'm not sure of the best way to make the device go read only. I started with the read-only infrastructure that the test system uses, but that appears to set the underlying device to read-only, not the ldiskfs file system. First, I tried to implement a server_remount_fs() function so you could do 'mount -o remount,ro', but that gets a Lustre superblock that you would then need to call down into the osd-api to actually make it do anything to the underlying file system. I then looked at adding an IOCTL that lctl could call, but that also appears to require support from the osd-api. I have a prototype patch that adds a new osd-api method, dt_abort_device() that could be called from either remount or lctl, but I'm not sure if it makes sense to add a new method just for testing this. Thoughts? Is there an easier way to cause the underlying filesystem to abort or go read-only?

            People

              jamesanunez James Nunez (Inactive)
              ezell Matt Ezell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: