Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9748

DNE recovery hangs, blocks Lustre recovery

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.11.0
    • Soak test cluster, lustre-master build 3606 version=2.9.59_32_g62bc3af
    • 3
    • 9223372036854775807

    Description

      Sequence:

      • MDS failover occurs.
      • failover nodes complete.
      • recovery across all MDS blocks
        Jul  7 15:34:17 soak-9 kernel: LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
        Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): recovery complete
        Jul  7 15:35:00 soak-9 kernel: LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
        Jul  7 15:35:06 soak-9 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
        Jul  7 15:35:06 soak-9 kernel: Lustre: soaked-MDT0001: Not available for connect from 192.168.1.132@o2ib (not set up)
        Jul  7 15:35:06 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
        Jul  7 15:35:07 soak-9 kernel: Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
        Jul  7 15:35:09 soak-9 kernel: Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 37 clients reconnect
        

        The failover node stays in a WAITING state:

        soak-10
        ----------------
        mdt.soaked-MDT0002.recovery_status=
        status: WAITING
        non-ready MDTs:  0003
        recovery_start: 1499451258
        time_waited: 2147
        
        Jul  7 18:29:12 soak-10 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0002: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
        Jul  7 18:29:12 soak-10 kernel: LustreError: Skipped 11 previous similar messages
        Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
        Jul  7 18:29:13 soak-10 kernel: Lustre: 3682:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 9 previous similar messages
        Jul  7 18:29:29 soak-10 kernel: Lustre: soaked-MDT0002: Recovery already passed deadline 0:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
        
        

      dumped lustre-logs on the MDS multiple times during this, dumped stacks, attached

      Attachments

        1. soak-8.stacks.and.console.txt.gz
          761 kB
        2. soak-8.2.log.gz
          3.19 MB
        3. soak-8.lustre.log.3.txt.gz
          4.20 MB
        4. soak-8.lustre.log.txt.gz
          12.24 MB
        5. soak-9.lustre.log.txt.gz
          0.5 kB
        6. soak-9.stacks.and.console.txt.gz
          321 kB
        7. soak-9.2.log.gz
          2.46 MB
        8. soak-9.lustre.log.3.txt.gz
          3.39 MB
        9. soak-10.lustre.log.txt.gz
          74 kB
        10. soak-10.stacks.and.console.txt.gz
          169 kB
        11. soak-11.lustre.log.txt.gz
          12 kB
        12. soak-11.stacks.and.console.txt.gz
          867 kB
        13. soak-10.lustre.log.3.txt.gz
          2.60 MB
        14. soak-10.2.log.gz
          1.13 MB
        15. soak-11.2.log.gz
          3.45 MB
        16. soak-11.lustre.log.3.txt.gz
          3.48 MB
        17. soak-9.postreboot.log.gz
          4 kB
        18. soak-8.postreboot.log.gz
          2 kB
        19. soak-10.postMGSreboot.log.gz
          3 kB
        20. soak-8.lustre.log.4.txt.gz
          470 kB
        21. soak-11.lustre.log.4.txt.gz
          449 kB
        22. soak-11.lustre.log.5.txt.gz
          20 kB
        23. soak-11.postreboot.log.gz
          3 kB
        24. soak-8.lustre.log.5.txt.gz
          1.65 MB
        25. soak-10.lustre.log.5.txt.gz
          3.84 MB
        26. soak-9.lustre.log.5.txt.gz
          3.76 MB
        27. soak-10.lustre.log.4.txt.gz
          4.67 MB
        28. soak-9.lustre.log.4.txt.gz
          5.02 MB
        29. soak-11.lustre.log.6.txt.gz
          21 kB
        30. soak-8.lustre.log.6.txt.gz
          1.77 MB
        31. soak-10.lustre.log.7.txt.gz
          3.41 MB
        32. soak-8.lustre.log.7.txt.gz
          3.93 MB
        33. soak-9.lustre.log.6.txt.gz
          3.93 MB
        34. soak-9.lustre.log.7.txt.gz
          3.52 MB
        35. soak-10.lustre.log.6.txt.gz
          3.71 MB
        36. soak-11.lustre.log.7.txt.gz
          3.48 MB

        Issue Links

          Activity

            [LU-9748] DNE recovery hangs, blocks Lustre recovery
            mdiep Minh Diep made changes -
            Fix Version/s New: Lustre 2.11.0 [ 13091 ]
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            mdiep Minh Diep added a comment -

            Landed in 2.11

            mdiep Minh Diep added a comment - Landed in 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28000/
            Subject: LU-9748 lod: safely access update log stat
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a446cbb8b1e2fa73c30938d043f79f644c13efe7

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28000/ Subject: LU-9748 lod: safely access update log stat Project: fs/lustre-release Branch: master Current Patch Set: Commit: a446cbb8b1e2fa73c30938d043f79f644c13efe7

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28000
            Subject: LU-9748 lod: safely access update log stat
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ceaae52b221d0986599796420add36305f01eff3

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28000 Subject: LU-9748 lod: safely access update log stat Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ceaae52b221d0986599796420add36305f01eff3
            laisiyao Lai Siyao added a comment -

            'grep lod_sub_recovery_ *' shows:

            soak-10.lustre.log.txt:00000004:00080000:16.0:1499451252.837946:0:3678:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0002-osd retrieve update log: rc = 0
            soak-10.2.log:00000004:00080000:19.0:1499351735.916949:0:3600:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 retrieve update log: rc = 0
            soak-10.2.log:00000004:00000010:19.0:1499351735.916950:0:3600:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0ba0.
            soak-10.2.log:00000004:00000001:19.0:1499351735.916982:0:3600:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)
            soak-10.2.log:00000004:00080000:16.0:1499351736.870806:0:3599:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 retrieve update log: rc = 0
            soak-10.2.log:00000004:00000010:16.0:1499351736.870813:0:3599:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0b80.
            soak-10.2.log:00000004:00000001:16.0:1499351736.870858:0:3599:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)
            soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692338:0:3681:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 retrieve update log: rc = 0
            soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692340:0:3681:0:(lod_dev.c:456:lod_sub_recovery_thread()) soaked-MDT0002 got update logs from all MDTs.
            soak-10.lustre.log.4.txt:00000004:00000010:3.0:1499455861.692358:0:3681:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff88082c4bccc0.
            soak-10.lustre.log.4.txt:00000004:00000001:3.0:1499455861.692412:0:3681:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0)
            

            MDT0002 retrieved remote update log from MDT0000 and MDT0001 at 1499351735, local update log at 1499451252, and from MDT0003 at 1499455861. It means it took 28hours to fetch update logs from other MDTs, this is really unacceptable. Since this last too long, logs of MDT0, MDT1 and MDT3 doesn't contain any useful information.

            laisiyao Lai Siyao added a comment - 'grep lod_sub_recovery_ *' shows: soak-10.lustre.log.txt:00000004:00080000:16.0:1499451252.837946:0:3678:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0002-osd retrieve update log: rc = 0 soak-10.2.log:00000004:00080000:19.0:1499351735.916949:0:3600:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 retrieve update log: rc = 0 soak-10.2.log:00000004:00000010:19.0:1499351735.916950:0:3600:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0ba0. soak-10.2.log:00000004:00000001:19.0:1499351735.916982:0:3600:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) soak-10.2.log:00000004:00080000:16.0:1499351736.870806:0:3599:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 retrieve update log: rc = 0 soak-10.2.log:00000004:00000010:16.0:1499351736.870813:0:3599:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff8804177b0b80. soak-10.2.log:00000004:00000001:16.0:1499351736.870858:0:3599:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692338:0:3681:0:(lod_dev.c:433:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 retrieve update log: rc = 0 soak-10.lustre.log.4.txt:00000004:00080000:3.0:1499455861.692340:0:3681:0:(lod_dev.c:456:lod_sub_recovery_thread()) soaked-MDT0002 got update logs from all MDTs. soak-10.lustre.log.4.txt:00000004:00000010:3.0:1499455861.692358:0:3681:0:(lod_dev.c:463:lod_sub_recovery_thread()) kfreed 'lrd': 32 at ffff88082c4bccc0. soak-10.lustre.log.4.txt:00000004:00000001:3.0:1499455861.692412:0:3681:0:(lod_dev.c:469:lod_sub_recovery_thread()) Process leaving (rc=0 : 0 : 0) MDT0002 retrieved remote update log from MDT0000 and MDT0001 at 1499351735, local update log at 1499451252, and from MDT0003 at 1499455861. It means it took 28hours to fetch update logs from other MDTs, this is really unacceptable. Since this last too long, logs of MDT0, MDT1 and MDT3 doesn't contain any useful information.
            laisiyao Lai Siyao made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]
            laisiyao Lai Siyao made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Lai Siyao [ laisiyao ]
            cliffw Cliff White (Inactive) made changes -
            Attachment New: soak-8.lustre.log.6.txt.gz [ 27517 ]
            Attachment New: soak-8.lustre.log.7.txt.gz [ 27518 ]
            Attachment New: soak-9.lustre.log.6.txt.gz [ 27519 ]
            Attachment New: soak-9.lustre.log.7.txt.gz [ 27520 ]
            Attachment New: soak-10.lustre.log.6.txt.gz [ 27521 ]
            Attachment New: soak-10.lustre.log.7.txt.gz [ 27522 ]
            Attachment New: soak-11.lustre.log.6.txt.gz [ 27523 ]
            Attachment New: soak-11.lustre.log.7.txt.gz [ 27524 ]

            With system completely idle, recovery finally completes:

            soak-8
            ----------------
            mdt.soaked-MDT0000.recovery_status=
            status: COMPLETE
            recovery_start: 1499455762
            recovery_duration: 1500
            completed_clients: 36/37
            replayed_requests: 0
            last_transno: 1297102100712
            VBR: DISABLED
            IR: DISABLED
            ----------------
            soak-9
            ----------------
            mdt.soaked-MDT0001.recovery_status=
            status: COMPLETE
            recovery_start: 1499446112
            recovery_duration: 47
            completed_clients: 7/37
            replayed_requests: 0
            last_transno: 1245546027187
            VBR: DISABLED
            IR: ENABLED
            ----------------
            soak-10
            ----------------
            mdt.soaked-MDT0002.recovery_status=
            status: COMPLETE
            recovery_start: 1499451258
            recovery_duration: 4604
            completed_clients: 35/37
            replayed_requests: 0
            last_transno: 1189768212435
            VBR: ENABLED
            IR: ENABLED
            ----------------
            soak-11
            ----------------
            mdt.soaked-MDT0003.recovery_status=
            status: COMPLETE
            recovery_start: 1499455317
            recovery_duration: 1678
            completed_clients: 37/37
            replayed_requests: 0
            last_transno: 1262744109605
            VBR: DISABLED
            IR: ENABLED
            
            cliffw Cliff White (Inactive) added a comment - With system completely idle, recovery finally completes: soak-8 ---------------- mdt.soaked-MDT0000.recovery_status= status: COMPLETE recovery_start: 1499455762 recovery_duration: 1500 completed_clients: 36/37 replayed_requests: 0 last_transno: 1297102100712 VBR: DISABLED IR: DISABLED ---------------- soak-9 ---------------- mdt.soaked-MDT0001.recovery_status= status: COMPLETE recovery_start: 1499446112 recovery_duration: 47 completed_clients: 7/37 replayed_requests: 0 last_transno: 1245546027187 VBR: DISABLED IR: ENABLED ---------------- soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: COMPLETE recovery_start: 1499451258 recovery_duration: 4604 completed_clients: 35/37 replayed_requests: 0 last_transno: 1189768212435 VBR: ENABLED IR: ENABLED ---------------- soak-11 ---------------- mdt.soaked-MDT0003.recovery_status= status: COMPLETE recovery_start: 1499455317 recovery_duration: 1678 completed_clients: 37/37 replayed_requests: 0 last_transno: 1262744109605 VBR: DISABLED IR: ENABLED
            cliffw Cliff White (Inactive) made changes -
            Attachment New: soak-8.lustre.log.4.txt.gz [ 27505 ]
            Attachment New: soak-8.lustre.log.5.txt.gz [ 27506 ]
            Attachment New: soak-8.postreboot.log.gz [ 27507 ]
            Attachment New: soak-9.lustre.log.4.txt.gz [ 27508 ]
            Attachment New: soak-9.lustre.log.5.txt.gz [ 27509 ]
            Attachment New: soak-9.postreboot.log.gz [ 27510 ]
            Attachment New: soak-10.lustre.log.4.txt.gz [ 27511 ]
            Attachment New: soak-10.lustre.log.5.txt.gz [ 27512 ]
            Attachment New: soak-10.postMGSreboot.log.gz [ 27513 ]
            Attachment New: soak-11.lustre.log.4.txt.gz [ 27514 ]
            Attachment New: soak-11.lustre.log.5.txt.gz [ 27515 ]
            Attachment New: soak-11.postreboot.log.gz [ 27516 ]

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: