Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8753

Recovery already passed deadline with DNE

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • lustre-2.8.0_3.chaos-1.ch6.x86_64
      16 MDTs
    • 3
    • 9223372036854775807

    Description

      MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
      MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

      2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
      2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages
      
      <ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
      2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
      2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
      2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
      2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      

      The earliest such messages are:

      2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
      2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32
      

      The last few are:

      2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32
      

      We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?

      Attachments

        1. target_to_node_map.nov28.txt
          0.3 kB
        2. mdt0b.0x240019a58_0x6_0x0.tgz
          12.23 MB
        3. mdt09.0x240019a58_0x6_0x0.tgz
          12.23 MB
        4. lustre.log.gz
          4.33 MB
        5. lsh-mdt000c-1b70.nov28.tgz
          6.81 MB
        6. logs.2016-11-14.tgz
          12.23 MB
        7. dk.zinc7.1480375634.gz
          13.52 MB
        8. dk.zinc13.1480375634.gz
          13.32 MB
        9. dk.zinc1.1480375634.gz
          12.12 MB
        10. dk.recovery_stuck.jet7.1477593344.gz
          7 kB
        11. dk.recovery_stuck.jet7.1477593159.gz
          53 kB
        12. dk.jet1.1478565846.gz
          681 kB
        13. dk.jet1.1478223101.gz
          596 kB
        14. console.zinc11.2016-12-19
          169 kB
        15. console.since-dec13.tgz
          1.71 MB
        16. console.jet7.gz
          1.12 MB
        17. console.jet11.2016-12-13-14-47
          14 kB
        18. console_logs.nov28.tgz
          18 kB
        19. 0x48000a04b-0x1-0x0.tgz
          106 kB

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: