Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8753

Recovery already passed deadline with DNE

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • lustre-2.8.0_3.chaos-1.ch6.x86_64
      16 MDTs
    • 3
    • 9223372036854775807

    Description

      MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
      MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

      2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
      2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages
      
      <ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
      2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
      2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
      2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
      2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      

      The earliest such messages are:

      2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
      2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32
      

      The last few are:

      2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32
      

      We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?

      Attachments

        1. 0x48000a04b-0x1-0x0.tgz
          106 kB
          Olaf Faaland
        2. console_logs.nov28.tgz
          18 kB
          Olaf Faaland
        3. console.jet11.2016-12-13-14-47
          14 kB
          Olaf Faaland
        4. console.jet7.gz
          1.12 MB
          Olaf Faaland
        5. console.since-dec13.tgz
          1.71 MB
          Olaf Faaland
        6. console.zinc11.2016-12-19
          169 kB
          Olaf Faaland
        7. dk.jet1.1478223101.gz
          596 kB
          Olaf Faaland
        8. dk.jet1.1478565846.gz
          681 kB
          Olaf Faaland
        9. dk.recovery_stuck.jet7.1477593159.gz
          53 kB
          Olaf Faaland
        10. dk.recovery_stuck.jet7.1477593344.gz
          7 kB
          Olaf Faaland
        11. dk.zinc1.1480375634.gz
          12.12 MB
          Olaf Faaland
        12. dk.zinc13.1480375634.gz
          13.32 MB
          Olaf Faaland
        13. dk.zinc7.1480375634.gz
          13.52 MB
          Olaf Faaland
        14. logs.2016-11-14.tgz
          12.23 MB
          Olaf Faaland
        15. lsh-mdt000c-1b70.nov28.tgz
          6.81 MB
          Olaf Faaland
        16. lustre.log.gz
          4.33 MB
          Olaf Faaland
        17. mdt09.0x240019a58_0x6_0x0.tgz
          12.23 MB
          Giuseppe Di Natale
        18. mdt0b.0x240019a58_0x6_0x0.tgz
          12.23 MB
          Olaf Faaland
        19. target_to_node_map.nov28.txt
          0.3 kB
          Olaf Faaland

        Issue Links

          Activity

            [LU-8753] Recovery already passed deadline with DNE
            pjones Peter Jones added a comment -

            As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,

            pjones Peter Jones added a comment - As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/
            Subject: LU-8753 osp: add rpc generation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0844905a308d614c86b56df70c8f03e5d59ee286

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/ Subject: LU-8753 osp: add rpc generation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0844905a308d614c86b56df70c8f03e5d59ee286

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/
            Subject: LU-8753 llog: remove lgh_write_offset
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/ Subject: LU-8753 llog: remove lgh_write_offset Project: fs/lustre-release Branch: master Current Patch Set: Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f

            Hi, Olaf

            So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13)

            2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
            2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22
            

            Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks.

            di.wang Di Wang (Inactive) added a comment - Hi, Olaf So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13) 2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80 2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22 Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks.

            According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange,

            2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
            2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages
            2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
            2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages
            2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100)
            2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages
            2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force.
            2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force.
            2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages
            2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted.
            

            Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.

            di.wang Di Wang (Inactive) added a comment - According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange, 2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages 2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages 2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100) 2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages 2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages 2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted. Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.

            Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.

            This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19

            ofaaland Olaf Faaland added a comment - Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks. This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19

            Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks.

            di.wang Di Wang (Inactive) added a comment - Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks.

            Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.

            di.wang Di Wang (Inactive) added a comment - Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.
            ofaaland Olaf Faaland added a comment -

            Di,

            I bounced servers or stopped and restarted lustre several times today, in varying combinations. MDT000a still seems to take longer to get connected than the others; several times the recovery_status procfile showed MDTs were waiting for MDT000a. However within a few minutes it seems to successfully complete recovery.

            I have the debug logs from one of these attempts, and can upload if it's helpful. Let me know.

            ofaaland Olaf Faaland added a comment - Di, I bounced servers or stopped and restarted lustre several times today, in varying combinations. MDT000a still seems to take longer to get connected than the others; several times the recovery_status procfile showed MDTs were waiting for MDT000a. However within a few minutes it seems to successfully complete recovery. I have the debug logs from one of these attempts, and can upload if it's helpful. Let me know.

            Hi, Olaf

            Thanks for uploading the file, it looks like some update log is not being written, so it leaves a hole there

            master transno = 60130033439 batchid = 55834779575 flags = 0 ops = 172 params = 84 rec_len 8520
            Bit 78 of 4 not set offset 491520
            offset 500040 index 79 type 106a0000
            master transno = 60130041273 batchid = 55834779576 flags = 0 ops = 88 params = 21 rec_len 3048
            rec #79 type=106a0000 len=3048 offset 500040, total 1
            offset 503088 index 0 type 0
            master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
            off 503088 skip 8520 to next chunk. test_bit yes      ---->>> skip 8520 bytes, then it is valid again.
            offset 511608 index 81 type 106a0000
            master transno = 60130043788 batchid = 55834779578 flags = 0 ops = 18 params = 3 rec_len 728
            Bit 81 of 4 not set offset 511608
            offset 512336 index 82 type 106a0000
            master transno = 60130043789 batchid = 55834779579 flags = 0 ops = 18 params = 3 rec_len 728
            Bit 82 of 4 not set offset 512336
            offset 513064 index 83 type 106a0000
            master transno = 60130043798 batchid = 55834779580 flags = 0 ops = 88 params = 21 rec_len 2968
            Bit 83 of 4 not set offset 513064
            offset 516032 index 84 type 106a0000
            

            So it looks like writing update record (rec_len = 8520), then the following write is not cancelled, so it cause a hole in the update log, which cause the issue.

            di.wang Di Wang (Inactive) added a comment - Hi, Olaf Thanks for uploading the file, it looks like some update log is not being written, so it leaves a hole there master transno = 60130033439 batchid = 55834779575 flags = 0 ops = 172 params = 84 rec_len 8520 Bit 78 of 4 not set offset 491520 offset 500040 index 79 type 106a0000 master transno = 60130041273 batchid = 55834779576 flags = 0 ops = 88 params = 21 rec_len 3048 rec #79 type=106a0000 len=3048 offset 500040, total 1 offset 503088 index 0 type 0 master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0 off 503088 skip 8520 to next chunk. test_bit yes ---->>> skip 8520 bytes, then it is valid again. offset 511608 index 81 type 106a0000 master transno = 60130043788 batchid = 55834779578 flags = 0 ops = 18 params = 3 rec_len 728 Bit 81 of 4 not set offset 511608 offset 512336 index 82 type 106a0000 master transno = 60130043789 batchid = 55834779579 flags = 0 ops = 18 params = 3 rec_len 728 Bit 82 of 4 not set offset 512336 offset 513064 index 83 type 106a0000 master transno = 60130043798 batchid = 55834779580 flags = 0 ops = 88 params = 21 rec_len 2968 Bit 83 of 4 not set offset 513064 offset 516032 index 84 type 106a0000 So it looks like writing update record (rec_len = 8520), then the following write is not cancelled, so it cause a hole in the update log, which cause the issue.

            People

              laisiyao Lai Siyao
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: