[LU-10842] Recovery stalls when target is failed over to failover partner Created: 22/Mar/18  Updated: 23/Mar/18  Resolved: 23/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Giuseppe Di Natale (Inactive) Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: llnl

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On our testbed filesystem lquake, I caused an OST to failover by unmounting OST0000 from jet17 and mounting it on jet18. The target successfully mounted on the failover node, but it appears the node is stuck recovering the newly acquired OST. Below is some information I collected. The system is stuck in the perpetual recovery state if anyone needs more information.

[root@jet18:~]# cat /proc/fs/lustre/obdfilter/lquake-OST0000/recovery_status 
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/91
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 352189672717

Clients are repeatedly logging the following:

[Thu Mar 22 10:13:16 2018] Lustre: 18447:0:(client.c:2109:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521738740/real 1521738740]  req@ffff881ff260dd00 x1595482252012848/t0(0) o8->lquake-OST0000-osc-ffff8801688d7800@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521738795 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

Some MDSs are seeing the following message:

[Thu Mar 22 09:38:26 2018] Lustre: lquake-OST0000-osc-MDT0001: Connection to lquake-OST0000 (at 172.19.1.127@o2ib100) was lost; in progress operations using this service will wait for recovery to complete

All MDSs appear to be logging the following message repeatedly:

[Thu Mar 22 09:41:27 2018] Lustre: 16423:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521736826/real 1521736826]  req@ffff883f416cbc00 x1595592082316800/t0(0) o8->lquake-OST0000-osc-MDT0001@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521736881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

Is there any other info that you need? Will this filesystem ever recover? Will these connections ever timeout?



 Comments   
Comment by Olaf Faaland [ 22/Mar/18 ]

Joe,

  1. What date and time was it when you umounted on jet17, and when you mounted on jet18?
  2. What does "status: " say in the recovery_status file for all the MDTs?
Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ]

Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33.

Recovery status for all MDTs:

[root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c
----------------
ejet1
----------------
status: COMPLETE
recovery_start: 1521675012
recovery_duration: 502
completed_clients: 90/91
replayed_requests: 3003
last_transno: 339312468139
VBR: ENABLED
IR: DISABLED
----------------
ejet2
----------------
status: COMPLETE
recovery_start: 1521675193
recovery_duration: 319
completed_clients: 90/90
replayed_requests: 0
last_transno: 365072220160
VBR: DISABLED
IR: DISABLED
----------------
ejet3
----------------
status: COMPLETE
recovery_start: 1521736312
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 347892350976
VBR: DISABLED
IR: ENABLED
----------------
ejet4
----------------
status: COMPLETE
recovery_start: 1521675875
recovery_duration: 70
completed_clients: 90/90
replayed_requests: 0
last_transno: 352187318272
VBR: DISABLED
IR: ENABLED
----------------
ejet5
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 335007449088
VBR: DISABLED
IR: ENABLED
----------------
ejet6
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597383680
VBR: DISABLED
IR: ENABLED
----------------
ejet7
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 330712481792
VBR: DISABLED
IR: ENABLED
----------------
ejet8
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 339302416384
VBR: DISABLED
IR: ENABLED
----------------
ejet9
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 390842023936
VBR: DISABLED
IR: ENABLED
----------------
ejet10
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 373662154752
VBR: DISABLED
IR: ENABLED
----------------
ejet[11,14]
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 339302416384
VBR: DISABLED
IR: ENABLED
----------------
ejet12
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597383680
VBR: DISABLED
IR: ENABLED
----------------
ejet13
----------------
status: COMPLETE
recovery_start: 1521675462
recovery_duration: 59
completed_clients: 90/90
replayed_requests: 0
last_transno: 335007449088
VBR: DISABLED
IR: ENABLED
----------------
ejet15
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 468151435264
VBR: DISABLED
IR: ENABLED
----------------
ejet16
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 356482285568
VBR: DISABLED
IR: ENABLED

For completeness, OSS recover status:

----------------
ejet18
----------------
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/91
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 352189672717
status: COMPLETE
recovery_start: 1521666573
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597406813
VBR: DISABLED
IR: DISABLED
----------------
ejet19
----------------
status: COMPLETE
recovery_start: 1521666577
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597408098
VBR: DISABLED
IR: DISABLED
----------------
ejet20
----------------
status: COMPLETE
recovery_start: 1521666584
recovery_duration: 64
completed_clients: 90/90
replayed_requests: 0
last_transno: 352187344607
VBR: DISABLED
IR: DISABLED
Comment by Olaf Faaland [ 22/Mar/18 ]

What are the NIDs assigned to jet17 and jet18?

Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ]

NIDs are as follows:

jet17            172.19.1.127@o2ib100
jet18            172.19.1.128@o2ib100 

 

Comment by Oleg Drokin [ 23/Mar/18 ]

So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients?

Comment by Andreas Dilger [ 23/Mar/18 ]

From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here...

My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client

Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ]

Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly.

Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ]

Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!

 

Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ]

Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help!

Comment by Peter Jones [ 23/Mar/18 ]

Yup. Things are totally locked down once tickets are moved to Closed state

Generated at Sat Feb 10 02:38:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.