Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8753

Recovery already passed deadline with DNE

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • lustre-2.8.0_3.chaos-1.ch6.x86_64
      16 MDTs
    • 3
    • 9223372036854775807

    Description

      MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
      MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

      2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
      2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages
      
      <ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
      2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
      2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
      2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
      2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      

      The earliest such messages are:

      2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
      2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32
      

      The last few are:

      2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32
      

      We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?

      Attachments

        1. 0x48000a04b-0x1-0x0.tgz
          106 kB
        2. console_logs.nov28.tgz
          18 kB
        3. console.jet11.2016-12-13-14-47
          14 kB
        4. console.jet7.gz
          1.12 MB
        5. console.since-dec13.tgz
          1.71 MB
        6. console.zinc11.2016-12-19
          169 kB
        7. dk.jet1.1478223101.gz
          596 kB
        8. dk.jet1.1478565846.gz
          681 kB
        9. dk.recovery_stuck.jet7.1477593159.gz
          53 kB
        10. dk.recovery_stuck.jet7.1477593344.gz
          7 kB
        11. dk.zinc1.1480375634.gz
          12.12 MB
        12. dk.zinc13.1480375634.gz
          13.32 MB
        13. dk.zinc7.1480375634.gz
          13.52 MB
        14. logs.2016-11-14.tgz
          12.23 MB
        15. lsh-mdt000c-1b70.nov28.tgz
          6.81 MB
        16. lustre.log.gz
          4.33 MB
        17. mdt09.0x240019a58_0x6_0x0.tgz
          12.23 MB
        18. mdt0b.0x240019a58_0x6_0x0.tgz
          12.23 MB
        19. target_to_node_map.nov28.txt
          0.3 kB

        Issue Links

          Activity

            [LU-8753] Recovery already passed deadline with DNE
            pjones Peter Jones added a comment -

            As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,

            pjones Peter Jones added a comment - As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/
            Subject: LU-8753 osp: add rpc generation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0844905a308d614c86b56df70c8f03e5d59ee286

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/ Subject: LU-8753 osp: add rpc generation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0844905a308d614c86b56df70c8f03e5d59ee286

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/
            Subject: LU-8753 llog: remove lgh_write_offset
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/ Subject: LU-8753 llog: remove lgh_write_offset Project: fs/lustre-release Branch: master Current Patch Set: Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f

            Hi, Olaf

            So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13)

            2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
            2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22
            

            Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks.

            di.wang Di Wang (Inactive) added a comment - Hi, Olaf So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13) 2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80 2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22 Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks.

            According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange,

            2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
            2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages
            2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
            2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages
            2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100)
            2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages
            2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force.
            2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force.
            2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages
            2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted.
            

            Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.

            di.wang Di Wang (Inactive) added a comment - According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange, 2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages 2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages 2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100) 2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages 2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages 2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted. Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.

            Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.

            This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19

            ofaaland Olaf Faaland added a comment - Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks. This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19

            Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks.

            di.wang Di Wang (Inactive) added a comment - Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks.

            People

              laisiyao Lai Siyao
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: