Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10419

LFSCK fails to start, hangs systems.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3
    • Soak performance cluster - Lustre version=2.10.2_4_gb151f34
    • 3
    • 9223372036854775807

    Description

      We do OSS failover, trigger LFSCK:

      
      

      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A{code]

      The lfsck start hangs, lfsck is not started, the clients wedge in state 'comp' the entire system wedges. I have dumped Lustre Logs from all MDS, attached. I have crash-dumped all the MDT nodes and the dumps are available on Spirit. lfsck_layout is unkillable.

      Attachments

        1. soak-9.lustre.log.gz
          2.33 MB
        2. soak-8.lustre.log.gz
          2.14 MB
        3. soak-11.lustre.log.gz
          2.22 MB
        4. soak-10.lustre.log.gz
          2.57 MB

        Issue Links

          Activity

            [LU-10419] LFSCK fails to start, hangs systems.

            John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/30831/
            Subject: LU-10419 lfsck: no delay for notify RPC
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 9fef9ad10b26a4338c22105e66308ead5408173e

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/30831/ Subject: LU-10419 lfsck: no delay for notify RPC Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 9fef9ad10b26a4338c22105e66308ead5408173e

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31627/
            Subject: LU-10419 lfsck: signal master engine when stop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1ece380412efd5dba2a8c345830f0456a4922301

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31627/ Subject: LU-10419 lfsck: signal master engine when stop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ece380412efd5dba2a8c345830f0456a4922301

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627
            Subject: LU-10419 lfsck: single master engine when stop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3e7d1a41711cfb0a12b941a88bf8c0bf3b4cc89

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627 Subject: LU-10419 lfsck: single master engine when stop Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3e7d1a41711cfb0a12b941a88bf8c0bf3b4cc89

            With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
            /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.

             

            So the trouble is not related with the patch https://review.whamcloud.com/31475/.

            I will make another patch to notify the master engine when lfsck_stop().

            yong.fan nasf (Inactive) added a comment - With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit. /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47 The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.   So the trouble is not related with the patch https://review.whamcloud.com/31475/. I will make another patch to notify the master engine when lfsck_stop().

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/
            Subject: Revert "LU-10419 lfsck: skip dead target"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9ba637b8949b1b8a5f2506e654a9b62d5c0cc245

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/ Subject: Revert " LU-10419 lfsck: skip dead target" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9ba637b8949b1b8a5f2506e654a9b62d5c0cc245

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600
            Subject: Revert "LU-10419 lfsck: skip dead target"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1387fa1c012dfdf5eb4f90efeb06edd45788064f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600 Subject: Revert " LU-10419 lfsck: skip dead target" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1387fa1c012dfdf5eb4f90efeb06edd45788064f
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/
            Subject: LU-10419 lfsck: skip dead target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 012834c5e7c7be50ff117cee4ac473d7fee4294d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/ Subject: LU-10419 lfsck: skip dead target Project: fs/lustre-release Branch: master Current Patch Set: Commit: 012834c5e7c7be50ff117cee4ac473d7fee4294d

            With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
            /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            cliffw Cliff White (Inactive) added a comment - With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit. /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475
            Subject: LU-10419 lfsck: skip dead target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be9f2eedf5039fa6308460aca6a84daa6b8003b1

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475 Subject: LU-10419 lfsck: skip dead target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be9f2eedf5039fa6308460aca6a84daa6b8003b1

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: