[LU-10842] Recovery stalls when target is failed over to failover partner Created: 22/Mar/18 Updated: 23/Mar/18 Resolved: 23/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | llnl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On our testbed filesystem lquake, I caused an OST to failover by unmounting OST0000 from jet17 and mounting it on jet18. The target successfully mounted on the failover node, but it appears the node is stuck recovering the newly acquired OST. Below is some information I collected. The system is stuck in the perpetual recovery state if anyone needs more information. [root@jet18:~]# cat /proc/fs/lustre/obdfilter/lquake-OST0000/recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/91 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 352189672717 Clients are repeatedly logging the following: [Thu Mar 22 10:13:16 2018] Lustre: 18447:0:(client.c:2109:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521738740/real 1521738740] req@ffff881ff260dd00 x1595482252012848/t0(0) o8->lquake-OST0000-osc-ffff8801688d7800@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521738795 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Some MDSs are seeing the following message: [Thu Mar 22 09:38:26 2018] Lustre: lquake-OST0000-osc-MDT0001: Connection to lquake-OST0000 (at 172.19.1.127@o2ib100) was lost; in progress operations using this service will wait for recovery to complete All MDSs appear to be logging the following message repeatedly: [Thu Mar 22 09:41:27 2018] Lustre: 16423:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521736826/real 1521736826] req@ffff883f416cbc00 x1595592082316800/t0(0) o8->lquake-OST0000-osc-MDT0001@172.19.1.127@o2ib100:28/4 lens 520/544 e 0 to 1 dl 1521736881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Is there any other info that you need? Will this filesystem ever recover? Will these connections ever timeout? |
| Comments |
| Comment by Olaf Faaland [ 22/Mar/18 ] |
|
Joe,
|
| Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ] |
|
Unmounted on jet17 and mounted on jet18 at Mar 22 09:36:33. Recovery status for all MDTs: [root@jeti:~]# pdsh -g mds 'cat /proc/fs/lustre/mdt/*/recovery_status' | dshbak -c ---------------- ejet1 ---------------- status: COMPLETE recovery_start: 1521675012 recovery_duration: 502 completed_clients: 90/91 replayed_requests: 3003 last_transno: 339312468139 VBR: ENABLED IR: DISABLED ---------------- ejet2 ---------------- status: COMPLETE recovery_start: 1521675193 recovery_duration: 319 completed_clients: 90/90 replayed_requests: 0 last_transno: 365072220160 VBR: DISABLED IR: DISABLED ---------------- ejet3 ---------------- status: COMPLETE recovery_start: 1521736312 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 347892350976 VBR: DISABLED IR: ENABLED ---------------- ejet4 ---------------- status: COMPLETE recovery_start: 1521675875 recovery_duration: 70 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187318272 VBR: DISABLED IR: ENABLED ---------------- ejet5 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet6 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet7 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 330712481792 VBR: DISABLED IR: ENABLED ---------------- ejet8 ---------------- status: COMPLETE recovery_start: 1521675438 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet9 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 390842023936 VBR: DISABLED IR: ENABLED ---------------- ejet10 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 373662154752 VBR: DISABLED IR: ENABLED ---------------- ejet[11,14] ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 339302416384 VBR: DISABLED IR: ENABLED ---------------- ejet12 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597383680 VBR: DISABLED IR: ENABLED ---------------- ejet13 ---------------- status: COMPLETE recovery_start: 1521675462 recovery_duration: 59 completed_clients: 90/90 replayed_requests: 0 last_transno: 335007449088 VBR: DISABLED IR: ENABLED ---------------- ejet15 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 468151435264 VBR: DISABLED IR: ENABLED ---------------- ejet16 ---------------- status: COMPLETE recovery_start: 1521675439 recovery_duration: 75 completed_clients: 90/90 replayed_requests: 0 last_transno: 356482285568 VBR: DISABLED IR: ENABLED For completeness, OSS recover status: ---------------- ejet18 ---------------- status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/91 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 352189672717 status: COMPLETE recovery_start: 1521666573 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597406813 VBR: DISABLED IR: DISABLED ---------------- ejet19 ---------------- status: COMPLETE recovery_start: 1521666577 recovery_duration: 71 completed_clients: 90/90 replayed_requests: 0 last_transno: 343597408098 VBR: DISABLED IR: DISABLED ---------------- ejet20 ---------------- status: COMPLETE recovery_start: 1521666584 recovery_duration: 64 completed_clients: 90/90 replayed_requests: 0 last_transno: 352187344607 VBR: DISABLED IR: DISABLED |
| Comment by Olaf Faaland [ 22/Mar/18 ] |
|
What are the NIDs assigned to jet17 and jet18? |
| Comment by Giuseppe Di Natale (Inactive) [ 22/Mar/18 ] |
|
NIDs are as follows: jet17 172.19.1.127@o2ib100 jet18 172.19.1.128@o2ib100
|
| Comment by Oleg Drokin [ 23/Mar/18 ] |
|
So it looks like OST_CONNECT mesasges destined to the failed over node don't get through? And so recovery does not start on OST either. So can you make sure networking andtargets are all doing hte expected thing? Can you do lnet pings.to the target addresses from clients? |
| Comment by Andreas Dilger [ 23/Mar/18 ] |
|
From the messages that are posted here, it appears that the clients and MDS are only trying to connect to jet17 (NID 172.19.1.127) and not to jet18 at all. The recovery status on jet18 also shows that no clients have tried to connect, and recovery hasn't even started for that OST. That could be an incorrect conclusion based on the three lines of console messages available here... My guess is that the OST0000 is not configured properly for failover. You should check the failover config for OST0000 on the MGS like: lctl --device MGS llog_print lquake-client |
| Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ] |
|
Andreas, looking at the output from lctl, I think you are right. None of my MDTs are listed in the catalog file twice with different nids. I could've sworn we set it up... let me try and configure failover again. I'll post back shortly. |
| Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ] |
|
Andreas, looks like that was it. I set the failover nodes with lctl and all is well again. Sorry for the noise!
|
| Comment by Giuseppe Di Natale (Inactive) [ 23/Mar/18 ] |
|
Looks like I can't edit the tags once an issue is closed. Peter, do you mind removing the topllnl tag? Thanks again for the help! |
| Comment by Peter Jones [ 23/Mar/18 ] |
|
Yup. Things are totally locked down once tickets are moved to Closed state |