[LU-10842] Recovery stalls when target is failed over to failover partner - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On our testbed filesystem lquake, I caused an OST to failover by unmounting OST0000 from jet17 and mounting it on jet18. The target successfully mounted on the failover node, but it appears the node is stuck recovering the newly acquired OST. Below is some information I collected. The system is stuck in the perpetual recovery state if anyone needs more information.

[root@jet18:~]# cat /proc/fs/lustre/obdfilter/lquake-OST0000/recovery_status 
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/91
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 352189672717

Clients are repeatedly logging the following:

[Thu Mar 22 10:13:16 2018] Lustre: 18447:0:(client.c:2109:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521738740/real 1521738740]  req@ffff881ff260dd00 x1595482252012848/t0(0) o8->lquake-OST0000-osc-ffff8801688d7800@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521738795 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

Some MDSs are seeing the following message:

[Thu Mar 22 09:38:26 2018] Lustre: lquake-OST0000-osc-MDT0001: Connection to lquake-OST0000 (at 172.19.1.127@o2ib100) was lost; in progress operations using this service will wait for recovery to complete

All MDSs appear to be logging the following message repeatedly:

[Thu Mar 22 09:41:27 2018] Lustre: 16423:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521736826/real 1521736826]  req@ffff883f416cbc00 x1595592082316800/t0(0) o8->lquake-OST0000-osc-MDT0001@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521736881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

Is there any other info that you need? Will this filesystem ever recover? Will these connections ever timeout?

Attachments

Activity

[LU-10842] Recovery stalls when target is failed over to failover partner

Peter Jones added a comment - 23/Mar/18 10:01 PM

Yup. Things are totally locked down once tickets are moved to Closed state

Peter Jones added a comment - 23/Mar/18 10:01 PM Yup. Things are totally locked down once tickets are moved to Closed state

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 9:32 PM

Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help!

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 9:32 PM Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help!

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 9:30 PM

Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 9:30 PM Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 6:14 PM - edited

Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly.

Giuseppe Di Natale (Inactive) added a comment - 23/Mar/18 6:14 PM - edited Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly.

Andreas Dilger added a comment - 23/Mar/18 5:26 PM

From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here...

My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client

Andreas Dilger added a comment - 23/Mar/18 5:26 PM From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here... My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client

Oleg Drokin added a comment - 23/Mar/18 5:13 PM

So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients?

Oleg Drokin added a comment - 23/Mar/18 5:13 PM So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients?

Giuseppe Di Natale (Inactive) added a comment - 22/Mar/18 6:13 PM

NIDs are as follows:

jet17            172.19.1.127@o2ib100
jet18            172.19.1.128@o2ib100

Giuseppe Di Natale (Inactive) added a comment - 22/Mar/18 6:13 PM NIDs are as follows: jet17 172.19.1.127@o2ib100 jet18 172.19.1.128@o2ib100

Olaf Faaland added a comment - 22/Mar/18 6:09 PM

What are the NIDs assigned to jet17 and jet18?

Olaf Faaland added a comment - 22/Mar/18 6:09 PM What are the NIDs assigned to jet17 and jet18?

Giuseppe Di Natale (Inactive) added a comment - 22/Mar/18 6:07 PM

Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33.

Recovery status for all MDTs:

[root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c
----------------
ejet1
----------------
status: COMPLETE
recovery_start: 1521675012
recovery_duration: 502
completed_clients: 90/91
replayed_requests: 3003
last_transno: 339312468139
VBR: ENABLED
IR: DISABLED
----------------
ejet2
----------------
status: COMPLETE
recovery_start: 1521675193
recovery_duration: 319
completed_clients: 90/90
replayed_requests: 0
last_transno: 365072220160
VBR: DISABLED
IR: DISABLED
----------------
ejet3
----------------
status: COMPLETE
recovery_start: 1521736312
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 347892350976
VBR: DISABLED
IR: ENABLED
----------------
ejet4
----------------
status: COMPLETE
recovery_start: 1521675875
recovery_duration: 70
completed_clients: 90/90
replayed_requests: 0
last_transno: 352187318272
VBR: DISABLED
IR: ENABLED
----------------
ejet5
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 335007449088
VBR: DISABLED
IR: ENABLED
----------------
ejet6
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597383680
VBR: DISABLED
IR: ENABLED
----------------
ejet7
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 330712481792
VBR: DISABLED
IR: ENABLED
----------------
ejet8
----------------
status: COMPLETE
recovery_start: 1521675438
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 339302416384
VBR: DISABLED
IR: ENABLED
----------------
ejet9
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 390842023936
VBR: DISABLED
IR: ENABLED
----------------
ejet10
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 373662154752
VBR: DISABLED
IR: ENABLED
----------------
ejet[11,14]
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 339302416384
VBR: DISABLED
IR: ENABLED
----------------
ejet12
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597383680
VBR: DISABLED
IR: ENABLED
----------------
ejet13
----------------
status: COMPLETE
recovery_start: 1521675462
recovery_duration: 59
completed_clients: 90/90
replayed_requests: 0
last_transno: 335007449088
VBR: DISABLED
IR: ENABLED
----------------
ejet15
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 468151435264
VBR: DISABLED
IR: ENABLED
----------------
ejet16
----------------
status: COMPLETE
recovery_start: 1521675439
recovery_duration: 75
completed_clients: 90/90
replayed_requests: 0
last_transno: 356482285568
VBR: DISABLED
IR: ENABLED

For completeness, OSS recover status:

----------------
ejet18
----------------
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/91
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 352189672717
status: COMPLETE
recovery_start: 1521666573
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597406813
VBR: DISABLED
IR: DISABLED
----------------
ejet19
----------------
status: COMPLETE
recovery_start: 1521666577
recovery_duration: 71
completed_clients: 90/90
replayed_requests: 0
last_transno: 343597408098
VBR: DISABLED
IR: DISABLED
----------------
ejet20
----------------
status: COMPLETE
recovery_start: 1521666584
recovery_duration: 64
completed_clients: 90/90
replayed_requests: 0
last_transno: 352187344607
VBR: DISABLED
IR: DISABLED

Giuseppe Di Natale (Inactive) added a comment - 22/Mar/18 6:07 PM Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33. Recovery status for all MDTs: [root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c ---------------- ejet1 ---------------- status: COMPLETE recovery_start: 1521675012 recovery_duration: 502 completed_clients: 90/91 replayed_requests: 3003 last_transno: 339312468139 VBR: ENABLED IR: DISABLED ---------------- ejet2 ---------------- status: COMPLETE recovery_start: 1521675193 recovery_duration: 319 completed_clients: 90/90 replayed_requests: 0 last_transno: 365072220160 VBR: DISABLED IR: DISABLED ---------------- ejet3 ---------------- status: COMPLETE recovery_start: 1521736312 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 347892350976 VBR: DISABLED IR: ENABLED ---------------- ejet4 ---------------- status: COMPLETE recovery_start: 1521675875 recovery_duration: 70 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187318272 VBR: DISABLED IR: ENABLED ---------------- ejet5 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet6 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet7 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 330712481792 VBR: DISABLED IR: ENABLED ---------------- ejet8 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet9 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 390842023936 VBR: DISABLED IR: ENABLED ---------------- ejet10 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 373662154752 VBR: DISABLED IR: ENABLED ---------------- ejet[11,14] ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet12 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet13 ---------------- status: COMPLETE recovery_start: 1521675462 recovery_duration: 59 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet15 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 468151435264 VBR: DISABLED IR: ENABLED ---------------- ejet16 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 356482285568 VBR: DISABLED IR: ENABLED For completeness, OSS recover status: ---------------- ejet18 ---------------- status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/91 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 352189672717 status: COMPLETE recovery_start: 1521666573 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597406813 VBR: DISABLED IR: DISABLED ---------------- ejet19 ---------------- status: COMPLETE recovery_start: 1521666577 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597408098 VBR: DISABLED IR: DISABLED ---------------- ejet20 ---------------- status: COMPLETE recovery_start: 1521666584 recovery_duration: 64 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187344607 VBR: DISABLED IR: DISABLED

Olaf Faaland added a comment - 22/Mar/18 5:55 PM - edited

Joe,

What date and time was it when you umounted on jet17, and when you mounted on jet18?
What does "status: " say in the recovery_status file for all the MDTs?

Olaf Faaland added a comment - 22/Mar/18 5:55 PM - edited Joe, What date and time was it when you umounted on jet17, and when you mounted on jet18? What does "status: " say in the recovery_status file for all the MDTs?

People

Assignee:: WC Triage

Reporter:: Giuseppe Di Natale (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Mar/18 5:45 PM

Updated:: 23/Mar/18 10:01 PM

Resolved:: 23/Mar/18 10:01 PM