Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9748

DNE recovery hangs, blocks Lustre recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.11.0
    • Soak test cluster, lustre-master build 3606 version=2.9.59_32_g62bc3af
    • 3
    • 9223372036854775807

    Description

      Sequence:

      • MDS failover occurs.
      • failover nodes complete.
      • recovery across all MDS blocks
        Jul  7 15:34:17 soak-9 kernel: LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
        Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): recovery complete
        Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
        Jul  7 15:35:06 soak-9 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
        Jul  7 15:35:06 soak-9 kernel: Lustre: soaked-MDT0001: Not available for connect from 192.168.1.132@o2ib (not set up)
        Jul  7 15:35:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
        Jul  7 15:35:07 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
        Jul  7 15:35:09 soak-9 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 37 clients reconnect
        

        The failover node stays in a WAITING state:

        soak-10
        ----------------
        mdt.soaked-MDT0002.recovery_status=
        status: WAITING
        non-ready MDTs:  0003
        recovery_start: 1499451258
        time_waited: 2147
        
        Jul  7 18:29:12 soak-10 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0002: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
        Jul  7 18:29:12 soak-10 kernel: LustreError: Skipped 11 previous similar messages
        Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
        Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 9 previous similar messages
        Jul  7 18:29:29 soak-10 kernel: Lustre: soaked-MDT0002: Recovery already passed deadline 0:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
        
        

      dumped lustre-logs on the MDS multiple times during this, dumped stacks, attached

      Attachments

        1. soak-10.2.log.gz
          1.13 MB
        2. soak-10.lustre.log.3.txt.gz
          2.60 MB
        3. soak-10.lustre.log.4.txt.gz
          4.67 MB
        4. soak-10.lustre.log.5.txt.gz
          3.84 MB
        5. soak-10.lustre.log.6.txt.gz
          3.71 MB
        6. soak-10.lustre.log.7.txt.gz
          3.41 MB
        7. soak-10.lustre.log.txt.gz
          74 kB
        8. soak-10.postMGSreboot.log.gz
          3 kB
        9. soak-10.stacks.and.console.txt.gz
          169 kB
        10. soak-11.2.log.gz
          3.45 MB
        11. soak-11.lustre.log.3.txt.gz
          3.48 MB
        12. soak-11.lustre.log.4.txt.gz
          449 kB
        13. soak-11.lustre.log.5.txt.gz
          20 kB
        14. soak-11.lustre.log.6.txt.gz
          21 kB
        15. soak-11.lustre.log.7.txt.gz
          3.48 MB
        16. soak-11.lustre.log.txt.gz
          12 kB
        17. soak-11.postreboot.log.gz
          3 kB
        18. soak-11.stacks.and.console.txt.gz
          867 kB
        19. soak-8.2.log.gz
          3.19 MB
        20. soak-8.lustre.log.3.txt.gz
          4.20 MB
        21. soak-8.lustre.log.4.txt.gz
          470 kB
        22. soak-8.lustre.log.5.txt.gz
          1.65 MB
        23. soak-8.lustre.log.6.txt.gz
          1.77 MB
        24. soak-8.lustre.log.7.txt.gz
          3.93 MB
        25. soak-8.lustre.log.txt.gz
          12.24 MB
        26. soak-8.postreboot.log.gz
          2 kB
        27. soak-8.stacks.and.console.txt.gz
          761 kB
        28. soak-9.2.log.gz
          2.46 MB
        29. soak-9.lustre.log.3.txt.gz
          3.39 MB
        30. soak-9.lustre.log.4.txt.gz
          5.02 MB
        31. soak-9.lustre.log.5.txt.gz
          3.76 MB
        32. soak-9.lustre.log.6.txt.gz
          3.93 MB
        33. soak-9.lustre.log.7.txt.gz
          3.52 MB
        34. soak-9.lustre.log.txt.gz
          0.5 kB
        35. soak-9.postreboot.log.gz
          4 kB
        36. soak-9.stacks.and.console.txt.gz
          321 kB

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: