Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10842

Recovery stalls when target is failed over to failover partner

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.10.3
    • 3
    • 9223372036854775807

    Description

      On our testbed filesystem lquake, I caused an OST to failover by unmounting OST0000 from jet17 and mounting it on jet18. The target successfully mounted on the failover node, but it appears the node is stuck recovering the newly acquired OST. Below is some information I collected. The system is stuck in the perpetual recovery state if anyone needs more information.

      [root@jet18:~]# cat /proc/fs/lustre/obdfilter/lquake-OST0000/recovery_status 
      status: RECOVERING
      recovery_start: 0
      time_remaining: 0
      connected_clients: 0/91
      req_replay_clients: 0
      lock_repay_clients: 0
      completed_clients: 0
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 0
      next_transno: 352189672717
      

      Clients are repeatedly logging the following:

      [Thu Mar 22 10:13:16 2018] Lustre: 18447:0:(client.c:2109:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521738740/real 1521738740]  req@ffff881ff260dd00 x1595482252012848/t0(0) o8->lquake-OST0000-osc-ffff8801688d7800@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521738795 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Some MDSs are seeing the following message:

      [Thu Mar 22 09:38:26 2018] Lustre: lquake-OST0000-osc-MDT0001: Connection to lquake-OST0000 (at 172.19.1.127@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
      

      All MDSs appear to be logging the following message repeatedly:

      [Thu Mar 22 09:41:27 2018] Lustre: 16423:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521736826/real 1521736826]  req@ffff883f416cbc00 x1595592082316800/t0(0) o8->lquake-OST0000-osc-MDT0001@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521736881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Is there any other info that you need? Will this filesystem ever recover? Will these connections ever timeout?

      Attachments

        Activity

          [LU-10842] Recovery stalls when target is failed over to failover partner
          pjones Peter Jones added a comment -

          Yup. Things are totally locked down once tickets are moved to Closed state

          pjones Peter Jones added a comment - Yup. Things are totally locked down once tickets are moved to Closed state

          Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help!

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help!

          Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!

           

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!  

          Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly.

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - - edited Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly.

          From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here...

          My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client

          adilger Andreas Dilger added a comment - From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here... My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client
          green Oleg Drokin added a comment -

          So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients?

          green Oleg Drokin added a comment - So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients?

          NIDs are as follows:

          jet17            172.19.1.127@o2ib100
          jet18            172.19.1.128@o2ib100 

           

          dinatale2 Giuseppe Di Natale (Inactive) added a comment - NIDs are as follows: jet17 172.19.1.127@o2ib100 jet18 172.19.1.128@o2ib100  
          ofaaland Olaf Faaland added a comment -

          What are the NIDs assigned to jet17 and jet18?

          ofaaland Olaf Faaland added a comment - What are the NIDs assigned to jet17 and jet18?

          Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33.

          Recovery status for all MDTs:

          [root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c
          ----------------
          ejet1
          ----------------
          status: COMPLETE
          recovery_start: 1521675012
          recovery_duration: 502
          completed_clients: 90/91
          replayed_requests: 3003
          last_transno: 339312468139
          VBR: ENABLED
          IR: DISABLED
          ----------------
          ejet2
          ----------------
          status: COMPLETE
          recovery_start: 1521675193
          recovery_duration: 319
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 365072220160
          VBR: DISABLED
          IR: DISABLED
          ----------------
          ejet3
          ----------------
          status: COMPLETE
          recovery_start: 1521736312
          recovery_duration: 71
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 347892350976
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet4
          ----------------
          status: COMPLETE
          recovery_start: 1521675875
          recovery_duration: 70
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 352187318272
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet5
          ----------------
          status: COMPLETE
          recovery_start: 1521675438
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 335007449088
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet6
          ----------------
          status: COMPLETE
          recovery_start: 1521675438
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 343597383680
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet7
          ----------------
          status: COMPLETE
          recovery_start: 1521675438
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 330712481792
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet8
          ----------------
          status: COMPLETE
          recovery_start: 1521675438
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 339302416384
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet9
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 390842023936
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet10
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 373662154752
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet[11,14]
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 339302416384
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet12
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 343597383680
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet13
          ----------------
          status: COMPLETE
          recovery_start: 1521675462
          recovery_duration: 59
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 335007449088
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet15
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 468151435264
          VBR: DISABLED
          IR: ENABLED
          ----------------
          ejet16
          ----------------
          status: COMPLETE
          recovery_start: 1521675439
          recovery_duration: 75
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 356482285568
          VBR: DISABLED
          IR: ENABLED
          

          For completeness, OSS recover status:

          ----------------
          ejet18
          ----------------
          status: RECOVERING
          recovery_start: 0
          time_remaining: 0
          connected_clients: 0/91
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 0
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 352189672717
          status: COMPLETE
          recovery_start: 1521666573
          recovery_duration: 71
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 343597406813
          VBR: DISABLED
          IR: DISABLED
          ----------------
          ejet19
          ----------------
          status: COMPLETE
          recovery_start: 1521666577
          recovery_duration: 71
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 343597408098
          VBR: DISABLED
          IR: DISABLED
          ----------------
          ejet20
          ----------------
          status: COMPLETE
          recovery_start: 1521666584
          recovery_duration: 64
          completed_clients: 90/90
          replayed_requests: 0
          last_transno: 352187344607
          VBR: DISABLED
          IR: DISABLED
          
          dinatale2 Giuseppe Di Natale (Inactive) added a comment - Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33. Recovery status for all MDTs: [root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c ---------------- ejet1 ---------------- status: COMPLETE recovery_start: 1521675012 recovery_duration: 502 completed_clients: 90/91 replayed_requests: 3003 last_transno: 339312468139 VBR: ENABLED IR: DISABLED ---------------- ejet2 ---------------- status: COMPLETE recovery_start: 1521675193 recovery_duration: 319 completed_clients: 90/90 replayed_requests: 0 last_transno: 365072220160 VBR: DISABLED IR: DISABLED ---------------- ejet3 ---------------- status: COMPLETE recovery_start: 1521736312 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 347892350976 VBR: DISABLED IR: ENABLED ---------------- ejet4 ---------------- status: COMPLETE recovery_start: 1521675875 recovery_duration: 70 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187318272 VBR: DISABLED IR: ENABLED ---------------- ejet5 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet6 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet7 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 330712481792 VBR: DISABLED IR: ENABLED ---------------- ejet8 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet9 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 390842023936 VBR: DISABLED IR: ENABLED ---------------- ejet10 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 373662154752 VBR: DISABLED IR: ENABLED ---------------- ejet[11,14] ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet12 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet13 ---------------- status: COMPLETE recovery_start: 1521675462 recovery_duration: 59 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet15 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 468151435264 VBR: DISABLED IR: ENABLED ---------------- ejet16 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 356482285568 VBR: DISABLED IR: ENABLED For completeness, OSS recover status: ---------------- ejet18 ---------------- status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/91 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 352189672717 status: COMPLETE recovery_start: 1521666573 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597406813 VBR: DISABLED IR: DISABLED ---------------- ejet19 ---------------- status: COMPLETE recovery_start: 1521666577 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597408098 VBR: DISABLED IR: DISABLED ---------------- ejet20 ---------------- status: COMPLETE recovery_start: 1521666584 recovery_duration: 64 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187344607 VBR: DISABLED IR: DISABLED
          ofaaland Olaf Faaland added a comment - - edited

          Joe,

          1. What date and time was it when you umounted on jet17, and when you mounted on jet18?
          2. What does "status: " say in the recovery_status file for all the MDTs?
          ofaaland Olaf Faaland added a comment - - edited Joe, What date and time was it when you umounted on jet17, and when you mounted on jet18? What does "status: " say in the recovery_status file for all the MDTs?

          People

            wc-triage WC Triage
            dinatale2 Giuseppe Di Natale (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: