Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2206

OSS stuck in recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.0
    • Older iteration of orion branch, lustre 2.3.49.54-68chaos
    • 3
    • 5251

    Description

      We keep getting OSS nodes stuck in recovery on Sequoia's filesystem. the recovery_stat file reports:

      $ cat /proc/fs/lustre/obdfilter/ls1-OST0005/recovery_status 
      status: RECOVERING
      recovery_start: 1350515102
      time_remaining: 0
      connected_clients: 357/787
      req_replay_clients: 32
      lock_repay_clients: 131
      completed_clients: 226
      evicted_clients: 430
      replayed_requests: 23
      queued_requests: 32
      next_transno: 12885381895
      

      On the console we're seeing clearly bad messages like:

      Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) reconnecting, waiting for 787 clients in recovery for -64:-32
      Lustre: Skipped 2006 previous similar messages
      Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 1908 previous similar messages
      

      Keep in mind, this is still the older orion branch code, our version 2.3.49.54-68chaos.

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: