Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10321

MDS - umount hangs during failback

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.10.2
    • Soak stress cluster MLNX networking stack.
    • 3
    • 9223372036854775807

    Description

      MDT 2 (soak-10) fails over to soak-11, with errors

      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(llog_osd.c:960:llog_osd_next_block()) soaked-MDT0003-osp-MDT0002: missed desired record? 2 > 1
      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -2
      Dec  2 07:27:00 soak-11 kernel: LustreError: 2976:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 3 previous similar messages
      Dec  2 07:27:01 soak-11 kernel: LustreError: 2381:0:(mdt_open.c:1167:mdt_cross_open()) soaked-MDT0002: [0x280002b4c:0xa44:0x0] doesn't exist!: rc = -14
      Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Dec  2 07:27:02 soak-11 kernel: Lustre: 2977:0:(ldlm_lib.c:2059:target_recovery_overseer()) Skipped 2 previous similar messages
      Dec  2 07:27:02 soak-11 kernel: Lustre: soaked-MDT0002: disconnecting 31 stale clients
      

      Soak attempts a umount which hangs:

      2017-12-02 07:27:16,430:fsmgmt.fsmgmt:INFO     Unmounting soaked-MDT0002 on soak-11 ...
      soak-11
      Dec  2 07:30:16 soak-11 kernel: INFO: task umount:3039 blocked for more than 120 seconds.
      Dec  2 07:30:16 soak-11 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Dec  2 07:30:16 soak-11 kernel: umount          D ffff8803c81f4008     0  3039   3037 0x00000080
      Dec  2 07:30:16 soak-11 kernel: ffff8803ce3afa30 0000000000000086 ffff88081fa50000 ffff8803ce3affd8
      Dec  2 07:30:16 soak-11 kernel: ffff8803ce3affd8 ffff8803ce3affd8 ffff88081fa50000 ffff8803c81f4000
      Dec  2 07:30:16 soak-11 kernel: ffff8803c81f4004 ffff88081fa50000 00000000ffffffff ffff8803c81f4008
      Dec  2 07:30:16 soak-11 kernel: Call Trace:
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816aa489>] schedule_preempt_disabled+0x29/0x70
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a83b7>] __mutex_lock_slowpath+0xc7/0x1d0
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff816a77cf>] mutex_lock+0x1f/0x2f
      Dec  2 07:30:16 soak-11 kernel: [<ffffffffc14560c7>] lfsck_stop+0x167/0x4e0 [lfsck]
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
      Dec  2 07:30:16 soak-11 kernel: [<ffffffff811e0593>] ? __kmalloc+0x1e3/0x230
      Dec  2 07:30:16 soak-11 kernel: [<ffffffffc1625aa6>] mdd_iocontrol+0x96/0x16a0 [mdd]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ec9619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc1500fc1>] mdt_device_fini+0x71/0x920 [mdt]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed6911>] class_cleanup+0x971/0xcd0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed8cad>] class_process_config+0x19cd/0x23b0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0dc6bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0ed9856>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0f07fee>] server_put_super+0x8de/0xcd0 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203692>] generic_shutdown_super+0x72/0x100
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203a62>] kill_anon_super+0x12/0x20
      Dec  2 07:30:17 soak-11 kernel: [<ffffffffc0edc152>] lustre_kill_super+0x32/0x50 [obdclass]
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81203e19>] deactivate_locked_super+0x49/0x60
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff81204586>] deactivate_super+0x46/0x60
      Dec  2 07:30:17 soak-11 kernel: [<ffffffff812217cf>] cleanup_mnt+0x3f/0x80
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff81221862>] __cleanup_mnt+0x12/0x20
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff810ad275>] task_work_run+0xc5/0xf0
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
      Dec  2 07:30:18 soak-11 kernel: [<ffffffff816b533d>] int_signal+0x12/0x17
      Dec  2 07:30:19 soak-11 kernel: LustreError: 11-0: soaked-OST0016-osc-MDT0002: operation ost_connect to node 192.168.1.106@o2ib failed: rc = -114
      

      This wedges soak, no further faults are attempted, jobs stop scheduling.
      This happened over the weekend. Dumped Lustre logs, forced a crash dump.
      Logs, crash info attached.
      Full crash dump is available on Spirit.

      Attachments

        1. soak-10.umount.hang.txt.gz
          9.10 MB
          Cliff White
        2. soak-11.stacks.txt.gz
          167 kB
          Cliff White
        3. soak-11.umount.hang.txt
          89.00 MB
          Cliff White
        4. soak-3.umount.hang.txt.gz
          6.83 MB
          Cliff White
        5. soak-6.umount.hang.txt.gz
          225 kB
          Cliff White
        6. vmcore-dmesg.txt
          1021 kB
          Cliff White
        7. vmcore-dmesg.txt
          1021 kB
          Cliff White

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: