Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      As mentioned in LU-6683, I ran into a situation where lctl lfsck_stop just hangs indefinitely.

      I have managed to reproduce this twice:

      start lfsck (using lctl lfsck_start -M play01-MDT0000 -t layout), this crashes the OSS servers, reboot the servers and restart the OSTs. Attempting to stop the lfsck in this state just hangs. I have waited >1h and it was still hanging. Unmounting the MDT in this situation also appears to be hanging (after 30 minutes I power cycled the MDS).

      Attachments

        1. 15.lctl.tgz
          631 kB
        2. lustre.dmesg.bz2
          37 kB
        3. lustre.log.bz2
          1.38 MB

        Issue Links

          Activity

            [LU-6684] lctl lfsck_stop hangs

            This is also delaying the landing of several patches.

            simmonsja James A Simmons added a comment - This is also delaying the landing of several patches.
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082
            Subject: LU-6684 lfsck: set the lfsck notify as interruptable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 68c078328be253735658fcf43fa98afff936ec6c

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/18082 Subject: LU-6684 lfsck: set the lfsck notify as interruptable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 68c078328be253735658fcf43fa98afff936ec6c

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059
            Subject: Revert "LU-6684 lfsck: stop lfsck even if some servers offline"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2505fd07b29ebfddcd29f16954908f6fe4670276

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/18059 Subject: Revert " LU-6684 lfsck: stop lfsck even if some servers offline" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2505fd07b29ebfddcd29f16954908f6fe4670276
            jamesanunez James Nunez (Inactive) added a comment - - edited

            More failures on master and all have the previous patch landed for this ticket:
            2016-01-15 15:29:21 - https://testing.hpdd.intel.com/test_sets/48126330-bbce-11e5-8506-5254006e85c2
            2016-01-15 20:20:20 - https://testing.hpdd.intel.com/test_sets/7ec04c5e-bbfa-11e5-acbb-5254006e85c2
            2016-01-16 00:40:11 - https://testing.hpdd.intel.com/test_sets/4988556c-bc05-11e5-8f65-5254006e85c2
            2016-01-18 22:08:02 - https://testing.hpdd.intel.com/test_sets/3a54dfd8-be63-11e5-92e8-5254006e85c2
            2016-01-18 22:59:29 - https://testing.hpdd.intel.com/test_sets/642d055a-be69-11e5-92e8-5254006e85c2
            2016-01-18 23:21:01 - https://testing.hpdd.intel.com/test_sets/c75e157e-be6e-11e5-b113-5254006e85c2
            2016-01-19 07:37:19 - https://testing.hpdd.intel.com/test_sets/325db7ae-beb4-11e5-8c8a-5254006e85c2
            2016-01-19 12:10:06 - https://testing.hpdd.intel.com/test_sets/144d9d36-bed9-11e5-ad7e-5254006e85c2
            2016-01-19 22:11:45 - https://testing.hpdd.intel.com/test_sets/a2f0fede-bf2e-11e5-a659-5254006e85c2
            2016-01-19 22:26:33 - https://testing.hpdd.intel.com/test_sets/dc0ed974-bf2f-11e5-8f04-5254006e85c2
            2016-01-19 23:59:17 - https://testing.hpdd.intel.com/test_sets/01d6b960-bf3f-11e5-8f04-5254006e85c2
            2016-01-21 11:12:25 - https://testing.hpdd.intel.com/test_sets/cd343b46-c061-11e5-a8e5-5254006e85c2
            2016-01-21 13:03:12 - https://testing.hpdd.intel.com/test_sets/c0b04f0e-c070-11e5-956d-5254006e85c2
            2016-01-21 14:41:59 - https://testing.hpdd.intel.com/test_sets/e4cffce0-c07f-11e5-a8e5-5254006e85c2
            2016-01-21 21:40:44 - https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2
            2016-01-22 03:45:40 - https://testing.hpdd.intel.com/test_sets/abf22bb0-c0ec-11e5-8d88-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - - edited More failures on master and all have the previous patch landed for this ticket: 2016-01-15 15:29:21 - https://testing.hpdd.intel.com/test_sets/48126330-bbce-11e5-8506-5254006e85c2 2016-01-15 20:20:20 - https://testing.hpdd.intel.com/test_sets/7ec04c5e-bbfa-11e5-acbb-5254006e85c2 2016-01-16 00:40:11 - https://testing.hpdd.intel.com/test_sets/4988556c-bc05-11e5-8f65-5254006e85c2 2016-01-18 22:08:02 - https://testing.hpdd.intel.com/test_sets/3a54dfd8-be63-11e5-92e8-5254006e85c2 2016-01-18 22:59:29 - https://testing.hpdd.intel.com/test_sets/642d055a-be69-11e5-92e8-5254006e85c2 2016-01-18 23:21:01 - https://testing.hpdd.intel.com/test_sets/c75e157e-be6e-11e5-b113-5254006e85c2 2016-01-19 07:37:19 - https://testing.hpdd.intel.com/test_sets/325db7ae-beb4-11e5-8c8a-5254006e85c2 2016-01-19 12:10:06 - https://testing.hpdd.intel.com/test_sets/144d9d36-bed9-11e5-ad7e-5254006e85c2 2016-01-19 22:11:45 - https://testing.hpdd.intel.com/test_sets/a2f0fede-bf2e-11e5-a659-5254006e85c2 2016-01-19 22:26:33 - https://testing.hpdd.intel.com/test_sets/dc0ed974-bf2f-11e5-8f04-5254006e85c2 2016-01-19 23:59:17 - https://testing.hpdd.intel.com/test_sets/01d6b960-bf3f-11e5-8f04-5254006e85c2 2016-01-21 11:12:25 - https://testing.hpdd.intel.com/test_sets/cd343b46-c061-11e5-a8e5-5254006e85c2 2016-01-21 13:03:12 - https://testing.hpdd.intel.com/test_sets/c0b04f0e-c070-11e5-956d-5254006e85c2 2016-01-21 14:41:59 - https://testing.hpdd.intel.com/test_sets/e4cffce0-c07f-11e5-a8e5-5254006e85c2 2016-01-21 21:40:44 - https://testing.hpdd.intel.com/test_sets/85d45ece-c0bc-11e5-9620-5254006e85c2 2016-01-22 03:45:40 - https://testing.hpdd.intel.com/test_sets/abf22bb0-c0ec-11e5-8d88-5254006e85c2

            And I verified that these two failures are on commits that include the fix that was recently landed here.

            adilger Andreas Dilger added a comment - And I verified that these two failures are on commits that include the fix that was recently landed here.
            yujian Jian Yu added a comment -

            sanity-lfsck test 32 still hung on master branch:

            stop LFSCK
            CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000
            

            https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2
            https://testing.hpdd.intel.com/test_sets/34b63ba8-bb61-11e5-acbb-5254006e85c2

            yujian Jian Yu added a comment - sanity-lfsck test 32 still hung on master branch: stop LFSCK CMD: onyx-57vm7 /usr/sbin/lctl lfsck_stop -M lustre-MDT0000 https://testing.hpdd.intel.com/test_sets/e45d9b64-bbac-11e5-acbb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/34b63ba8-bb61-11e5-acbb-5254006e85c2

            The patch has been landed to master.

            yong.fan nasf (Inactive) added a comment - The patch has been landed to master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/
            Subject: LU-6684 lfsck: stop lfsck even if some servers offline
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17032/ Subject: LU-6684 lfsck: stop lfsck even if some servers offline Project: fs/lustre-release Branch: master Current Patch Set: Commit: afcf3026c6ad203b9882eaeac76326357f26fe71

            There are several cases:

            1) The LFSCK/OI scrub is running on the MDS which to be remounted.

            1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused.

            1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed.

            2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted.

            2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened.

            2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process.

            In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

            yong.fan nasf (Inactive) added a comment - There are several cases: 1) The LFSCK/OI scrub is running on the MDS which to be remounted. 1.1) If the MDT is umounted when the LFSCK/OI scrub running at background, then the LFSCK/OI scrub status will be marked as paused. And when the MDT is remounted up, after the recovery done, the paused LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before paused. 1.2) If the MDT crashed when the LFSCK/OI scrub running at background, then there is no time for LFSCK/OI scrub to change its status. When the MDT is remounted up, its status will be marked as crashed, and after the recovery done, the crashed LFSCK/OI scrub will be resumed from the latest checkpoint, its status will be set as the one before crashed. 2) Assume the LFSCK/OI scrub is running on one MDT_a, another related server MDT_b/OST_c to be remounted. 2.1) If the LFSCK on the MDT_a needs to talk with the MDT_b/OST_c for verification that is amounted/crashed, then the LFSCK on MDT_a will get related connection failure, and then it knows that some of the peer server has left the LFSCK, and then the LFSCK on MDT_a will go ahead to verify part of the system, neither wait for ever nor fail out unless you specified "-e abort". So the LFSCK on the MDT_a can finish finally, and the status will be 'partial' if no other failure happened. 2.2) If we want to stop the LFSCK on the MDT_a, then the MDT_a needs to notify related peer MDT_b/OST_c to stop the LFSCK also. But it found the peer server MDT_b/OST_c is offline already, then the LFSCK on the MDT_a will go ahead to handle the stop process. In this ticket, we hit trouble in the 2.2) case. Because the LFSCK did not detect OST_c offline, the lfsck_stop was blocked by the reconnection to the OST_c.

            Andreas and Fan,
            What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference.

            • Here MDS is going in recovery while OI scrubbing operation is underway.
            • The lfsck ns assistant stage2 is restarted and post operation done.
            • Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)
            maximus Ashish Purkar (Inactive) added a comment - Andreas and Fan, What will happen in dry-run mode of OI scrub if MDS recovery happen or MDT/OST down and reconnecting? Attaching log file 15.lctl.tgz for reference. Here MDS is going in recovery while OI scrubbing operation is underway. The lfsck ns assistant stage2 is restarted and post operation done. Test expects dry-run to be completed in 6 sec but due to failover and MDS undergoing recovery, it's taking more time( > 6 sec)

            People

              yong.fan nasf (Inactive)
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: