Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.10.2
    • Soak stress cluster MLNX networking stack.
    • 3
    • 9223372036854775807

    Description

      MDT 2 (soak-10) fails over to soak-11, with errors

      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(llog_osd.c:960:llog_osd_next_block()) soaked-MDT0003-osp-MDT0002: missed desired record? 2 > 1
      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -2
      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 3 previous similar messages
      Dec  2 07:27:01 soak-11 kernel: LustreError: 2381:0:(mdt_open.c:1167:mdt_cross_open()) soaked-MDT0002: [0x280002b4c:0xa44:0x0] doesn't exist!: rc = -14
      Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages
      Dec  2 07:27:02 soak-11 kernel: Lustre: soaked-MDT0002: disconnecting 31 stale clients
      

      Soak attempts a umount which hangs:

      2017-12-02 07:27:16,430:fsmgmt.fsmgmt:INFO     Unmounting soaked-MDT0002 on soak-11 ...
      soak-11
      Dec  2 07:30:16 soak-11 kernel: INFO: task umount:3039 blocked for more than 120 seconds.
      Dec  2 07:30:16 soak-11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Dec  2 07:30:16 soak-11 kernel: umount          D ffff8803c81f4008     0  3039   3037 0x00000080
      Dec  2 07:30:16 soak-11 kernel: ffff8803ce3afa30 0000000000000086 ffff88081fa50000 ffff8803ce3affd8
      Dec  2 07:30:16 soak-11 kernel: ffff8803ce3affd8 ffff8803ce3affd8 ffff88081fa50000 ffff8803c81f4000
      Dec  2 07:30:16 soak-11 kernel: ffff8803c81f4004 ffff88081fa50000 00000000ffffffff ffff8803c81f4008
      Dec  2 07:30:16 soak-11 kernel: Call Trace:
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816aa489>] schedule_preempt_disabled+0x29/0x70
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a83b7>] __mutex_lock_slowpath+0xc7/0x1d0
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a77cf>] mutex_lock+0x1f/0x2f
      Dec  2 07:30:16 soak-11 kernel: [<ffffffffc14560c7>] lfsck_stop+0x167/0x4e0 [lfsck]
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff811e0593>] ? __kmalloc+0x1e3/0x230
      Dec  2 07:30:16 soak-11 kernel: [<ffffffffc1625aa6>] mdd_iocontrol+0x96/0x16a0 [mdd]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ec9619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc1500fc1>] mdt_device_fini+0x71/0x920 [mdt]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed6911>] class_cleanup+0x971/0xcd0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed8cad>] class_process_config+0x19cd/0x23b0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0dc6bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed9856>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0f07fee>] server_put_super+0x8de/0xcd0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203692>] generic_shutdown_super+0x72/0x100
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203a62>] kill_anon_super+0x12/0x20
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0edc152>] lustre_kill_super+0x32/0x50 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203e19>] deactivate_locked_super+0x49/0x60
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81204586>] deactivate_super+0x46/0x60
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff812217cf>] cleanup_mnt+0x3f/0x80
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff81221862>] __cleanup_mnt+0x12/0x20
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff810ad275>] task_work_run+0xc5/0xf0
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff816b533d>] int_signal+0x12/0x17
      Dec  2 07:30:19 soak-11 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0002: operation ost_connect to node 192.168.1.106@o2ib failed: rc = -114
      

      This wedges soak, no further faults are attempted, jobs stop scheduling.
      This happened over the weekend. Dumped Lustre logs, forced a crash dump.
      Logs, crash info attached.
      Full crash dump is available on Spirit.

      Attachments

        1. soak-10.umount.hang.txt.gz
          9.10 MB
        2. soak-11.stacks.txt.gz
          167 kB
        3. soak-11.umount.hang.txt
          89.00 MB
        4. soak-3.umount.hang.txt.gz
          6.83 MB
        5. soak-6.umount.hang.txt.gz
          225 kB
        6. vmcore-dmesg.txt
          1021 kB
        7. vmcore-dmesg.txt
          1021 kB

        Issue Links

          Activity

            [LU-10321] MDS - umount hangs during failback

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30420/
            Subject: LU-10321 lfsck: allow to stop the in-starting lfsck
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9c9a05fee6c0fce557dfa578ff7116b905d4e00a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30420/ Subject: LU-10321 lfsck: allow to stop the in-starting lfsck Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9c9a05fee6c0fce557dfa578ff7116b905d4e00a

            We are still hitting this issue, with the patch. Soak-10 just hit it, crash dump vmcore-dmesg and lustre-log attached

            cliffw Cliff White (Inactive) added a comment - We are still hitting this issue, with the patch. Soak-10 just hit it, crash dump vmcore-dmesg and lustre-log attached

            Okay, we have an IB build so we will test this.

            cliffw Cliff White (Inactive) added a comment - Okay, we have an IB build so we will test this.

            We just hit this again on master, so we need a version of the patch for master.

            cliffw Cliff White (Inactive) added a comment - We just hit this again on master, so we need a version of the patch for master.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30514
            Subject: LU-10321 lfsck: not start lfsck during umount
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 2022d417ddaf663dc7addb5389acade0390996e5

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30514 Subject: LU-10321 lfsck: not start lfsck during umount Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 2022d417ddaf663dc7addb5389acade0390996e5

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30513
            Subject: LU-10321 lfsck: not start lfsck during umount
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6263064b369dd4fddbb0dfa9ab49013a0d791629

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30513 Subject: LU-10321 lfsck: not start lfsck during umount Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6263064b369dd4fddbb0dfa9ab49013a0d791629

            According to the logs, three were new LFSCK started just after lfsck_stop during the MDT umount. Then nobody will stop the new triggered LFSCK as to the MDT cannot umount. I will make patch to resolve the race condition.

            yong.fan nasf (Inactive) added a comment - According to the logs, three were new LFSCK started just after lfsck_stop during the MDT umount. Then nobody will stop the new triggered LFSCK as to the MDT cannot umount. I will make patch to resolve the race condition.

            We have not seen lfsck hangs with MDT failover after the patch. The two hangs occurred with OST failover. After the most recent hang, i was able to reboot/remount the system and then I could start and stop lfsck without the hang.

            cliffw Cliff White (Inactive) added a comment - We have not seen lfsck hangs with MDT failover after the patch. The two hangs occurred with OST failover. After the most recent hang, i was able to reboot/remount the system and then I could start and stop lfsck without the hang.

            cliffw, as you said on Skype, there are multiple issues during current Spirit tests. The original hung happened on the MDT because of the blocked uninterrupted LFSCK. The patch 30420 is used for resolving such trouble. But it does not means the Spirit will not hung after applying the patch, because some other issues may also block the system. For the new hung, it happened on the OST side, different from the original LFSCK hung. So would you please to check whether the original LFSCK hung issue resolved or not? Thanks!

            yong.fan nasf (Inactive) added a comment - cliffw , as you said on Skype, there are multiple issues during current Spirit tests. The original hung happened on the MDT because of the blocked uninterrupted LFSCK. The patch 30420 is used for resolving such trouble. But it does not means the Spirit will not hung after applying the patch, because some other issues may also block the system. For the new hung, it happened on the OST side, different from the original LFSCK hung. So would you please to check whether the original LFSCK hung issue resolved or not? Thanks!

            Had a simular hang on soak-3 and soak-6 (OSS) during umount. Dumped lustre-logs, attached. Also dumped stack traces and crash dumped both nodes, output available on spirit.

            cliffw Cliff White (Inactive) added a comment - Had a simular hang on soak-3 and soak-6 (OSS) during umount. Dumped lustre-logs, attached. Also dumped stack traces and crash dumped both nodes, output available on spirit.

            Dumped lustre logs from all MDS, output in /scratch/results/soak/soak-X-hung.lfsck.Dec08.txt
            Dumped stacks on all MDS, output in console logs. Crash-dumped all MDS, results on Spirit.
            Restarting.

            cliffw Cliff White (Inactive) added a comment - Dumped lustre logs from all MDS, output in /scratch/results/soak/soak-X-hung.lfsck.Dec08.txt Dumped stacks on all MDS, output in console logs. Crash-dumped all MDS, results on Spirit. Restarting.

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: