Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10842

Recovery stalls when target is failed over to failover partner

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.10.3
    • 3
    • 9223372036854775807

    Description

      On our testbed filesystem lquake, I caused an OST to failover by unmounting OST0000 from jet17 and mounting it on jet18. The target successfully mounted on the failover node, but it appears the node is stuck recovering the newly acquired OST. Below is some information I collected. The system is stuck in the perpetual recovery state if anyone needs more information.

      [root@jet18:~]# cat /proc/fs/lustre/obdfilter/lquake-OST0000/recovery_status 
      status: RECOVERING
      recovery_start: 0
      time_remaining: 0
      connected_clients: 0/91
      req_replay_clients: 0
      lock_repay_clients: 0
      completed_clients: 0
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 0
      next_transno: 352189672717
      

      Clients are repeatedly logging the following:

      [Thu Mar 22 10:13:16 2018] Lustre: 18447:0:(client.c:2109:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521738740/real 1521738740]  req@ffff881ff260dd00 x1595482252012848/t0(0) o8->lquake-OST0000-osc-ffff8801688d7800@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521738795 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Some MDSs are seeing the following message:

      [Thu Mar 22 09:38:26 2018] Lustre: lquake-OST0000-osc-MDT0001: Connection to lquake-OST0000 (at 172.19.1.127@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
      

      All MDSs appear to be logging the following message repeatedly:

      [Thu Mar 22 09:41:27 2018] Lustre: 16423:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521736826/real 1521736826]  req@ffff883f416cbc00 x1595592082316800/t0(0) o8->lquake-OST0000-osc-MDT0001@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521736881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Is there any other info that you need? Will this filesystem ever recover? Will these connections ever timeout?

      Attachments

        Activity

          People

            wc-triage WC Triage
            dinatale2 Giuseppe Di Natale (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: