Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2206

OSS stuck in recovery

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • Lustre 2.4.0
    • Older iteration of orion branch, lustre 2.3.49.54-68chaos
    • 3
    • 5251

      We keep getting OSS nodes stuck in recovery on Sequoia's filesystem. the recovery_stat file reports:

      $ cat /proc/fs/lustre/obdfilter/ls1-OST0005/recovery_status 
      status: RECOVERING
      recovery_start: 1350515102
      time_remaining: 0
      connected_clients: 357/787
      req_replay_clients: 32
      lock_repay_clients: 131
      completed_clients: 226
      evicted_clients: 430
      replayed_requests: 23
      queued_requests: 32
      next_transno: 12885381895
      

      On the console we're seeing clearly bad messages like:

      Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) reconnecting, waiting for 787 clients in recovery for -64:-32
      Lustre: Skipped 2006 previous similar messages
      Lustre: ls1-OST0005: Client 6307b568-c73d-a978-48ac-1fc11c345ba7 (at 172.20.11.30@o2ib500) refused reconnection, still busy with 1 active RPCs
      Lustre: Skipped 1908 previous similar messages
      

      Keep in mind, this is still the older orion branch code, our version 2.3.49.54-68chaos.

            bzzz Alex Zhuravlev
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: