Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10419

LFSCK fails to start, hangs systems.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3
    • Soak performance cluster - Lustre version=2.10.2_4_gb151f34
    • 3
    • 9223372036854775807

    Description

      We do OSS failover, trigger LFSCK:

      
      

      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A{code]

      The lfsck start hangs, lfsck is not started, the clients wedge in state 'comp' the entire system wedges. I have dumped Lustre Logs from all MDS, attached. I have crash-dumped all the MDT nodes and the dumps are available on Spirit. lfsck_layout is unkillable.

      Attachments

        1. soak-10.lustre.log.gz
          2.57 MB
        2. soak-11.lustre.log.gz
          2.22 MB
        3. soak-8.lustre.log.gz
          2.14 MB
        4. soak-9.lustre.log.gz
          2.33 MB

        Issue Links

          Activity

            [LU-10419] LFSCK fails to start, hangs systems.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627
            Subject: LU-10419 lfsck: single master engine when stop
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3e7d1a41711cfb0a12b941a88bf8c0bf3b4cc89

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31627 Subject: LU-10419 lfsck: single master engine when stop Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3e7d1a41711cfb0a12b941a88bf8c0bf3b4cc89

            With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
            /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.

             

            So the trouble is not related with the patch https://review.whamcloud.com/31475/.

            I will make another patch to notify the master engine when lfsck_stop().

            yong.fan nasf (Inactive) added a comment - With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit. /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47 The LFSCK master engine was blocked when sending OUT_ATTR_GET RPC to MDT2 that may be offline or in recovery. We expect the lfsck_stop() can wakeup the blocked LFSCK engines and make them to exit, but we only single (SIGINT) the LFSCK assistant engines, forget to do that for the LFSCK master engine.   So the trouble is not related with the patch https://review.whamcloud.com/31475/. I will make another patch to notify the master engine when lfsck_stop().

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/
            Subject: Revert "LU-10419 lfsck: skip dead target"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9ba637b8949b1b8a5f2506e654a9b62d5c0cc245

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31600/ Subject: Revert " LU-10419 lfsck: skip dead target" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9ba637b8949b1b8a5f2506e654a9b62d5c0cc245

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600
            Subject: Revert "LU-10419 lfsck: skip dead target"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1387fa1c012dfdf5eb4f90efeb06edd45788064f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31600 Subject: Revert " LU-10419 lfsck: skip dead target" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1387fa1c012dfdf5eb4f90efeb06edd45788064f
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/
            Subject: LU-10419 lfsck: skip dead target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 012834c5e7c7be50ff117cee4ac473d7fee4294d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31475/ Subject: LU-10419 lfsck: skip dead target Project: fs/lustre-release Branch: master Current Patch Set: Commit: 012834c5e7c7be50ff117cee4ac473d7fee4294d

            With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit.
            /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            cliffw Cliff White (Inactive) added a comment - With the current patch, lfsck does not stop. Currently also having mount timeouts. I have crashed dumped soak-8 while lfsck was hanging, logs are available on spirit. /scratch/dumps/soak-8.spirit.hpdd.intel.com/10.10.1.108-2018-03-06-19:16:47

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475
            Subject: LU-10419 lfsck: skip dead target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be9f2eedf5039fa6308460aca6a84daa6b8003b1

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31475 Subject: LU-10419 lfsck: skip dead target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be9f2eedf5039fa6308460aca6a84daa6b8003b1

            Logs are on spirit /scratch/logs/syslogs and /scratch/logs/console. The crash dumps are in /scratch/dumps on spirit.

            cliffw Cliff White (Inactive) added a comment - Logs are on spirit /scratch/logs/syslogs and /scratch/logs/console. The crash dumps are in /scratch/dumps on spirit.

            cliffw,

            Where can I get related logs?

            Thanks!

            yong.fan nasf (Inactive) added a comment - cliffw , Where can I get related logs? Thanks!

            Seeing this again on DNE-enable system. version=2.10.57_58_gf24340c
            I can crash dump systems if desired

            cliffw Cliff White (Inactive) added a comment - Seeing this again on DNE-enable system. version=2.10.57_58_gf24340c I can crash dump systems if desired

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: