Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8753

Recovery already passed deadline with DNE

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • lustre-2.8.0_3.chaos-1.ch6.x86_64
      16 MDTs
    • 3
    • 9223372036854775807

    Description

      MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
      MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

      2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
      2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages
      
      <ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
      2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
      2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
      2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
      2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      

      The earliest such messages are:

      2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
      2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32
      

      The last few are:

      2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
      2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32
      

      We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?

      Attachments

        1. 0x48000a04b-0x1-0x0.tgz
          106 kB
          Olaf Faaland
        2. console_logs.nov28.tgz
          18 kB
          Olaf Faaland
        3. console.jet11.2016-12-13-14-47
          14 kB
          Olaf Faaland
        4. console.jet7.gz
          1.12 MB
          Olaf Faaland
        5. console.since-dec13.tgz
          1.71 MB
          Olaf Faaland
        6. console.zinc11.2016-12-19
          169 kB
          Olaf Faaland
        7. dk.jet1.1478223101.gz
          596 kB
          Olaf Faaland
        8. dk.jet1.1478565846.gz
          681 kB
          Olaf Faaland
        9. dk.recovery_stuck.jet7.1477593159.gz
          53 kB
          Olaf Faaland
        10. dk.recovery_stuck.jet7.1477593344.gz
          7 kB
          Olaf Faaland
        11. dk.zinc1.1480375634.gz
          12.12 MB
          Olaf Faaland
        12. dk.zinc13.1480375634.gz
          13.32 MB
          Olaf Faaland
        13. dk.zinc7.1480375634.gz
          13.52 MB
          Olaf Faaland
        14. logs.2016-11-14.tgz
          12.23 MB
          Olaf Faaland
        15. lsh-mdt000c-1b70.nov28.tgz
          6.81 MB
          Olaf Faaland
        16. lustre.log.gz
          4.33 MB
          Olaf Faaland
        17. mdt09.0x240019a58_0x6_0x0.tgz
          12.23 MB
          Giuseppe Di Natale
        18. mdt0b.0x240019a58_0x6_0x0.tgz
          12.23 MB
          Olaf Faaland
        19. target_to_node_map.nov28.txt
          0.3 kB
          Olaf Faaland

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: