[LU-12990] MDS failed to mount during failover Created: 20/Nov/19  Updated: 21/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: soak

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

soak triggered mds_failover testing. According to the soak.log, MDT0003 from the failing MDS(soak-11) should be mounted on failover pair soak-10, but it didn't.

soak.log

2019-11-19 01:22:13,931:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
2019-11-19 01:22:20,107:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
2019-11-19 01:22:25,285:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
2019-11-19 01:22:26,296:fsmgmt.fsmgmt:INFO     soak-11 is up!!!
2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0003 ...
2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on soak-10 ...

Here is the console log on soak-10 around that time

[17741.278456] device-mapper: multipath: Failing path 8:128.
[17746.279544] device-mapper: multipath: Reinstating path 8:128.
[17746.286032] device-mapper: multipath: Failing path 8:128.
[17747.871994] LNetError: 6527:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[17747.883281] LNetError: 6527:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 192.168.1.111@o2ib (10): c: 7, oc: 0, rc: 8
[17747.897005] LNetError: 6533:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.111@o2ib added to recovery queue. Health = 900
[17747.911953] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 192.168.1.110@o2ib added to recovery queue. Health = 900
[17747.925462] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 5 previous similar messages
[17747.937096] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1574126428/real 1574126433]  req@ffff899fea070000 x1650580018692480/t0(0) o41->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib:24/4 lens 224/368 e 0 to 1 dl 1574126435 ref 1 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
[17747.970231] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[17747.980974] Lustre: soaked-MDT0003-osp-MDT0002: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[17751.292982] device-mapper: multipath: Reinstating path 8:128.
[17751.299695] device-mapper: multipath: Failing path 8:128.
[17756.300695] device-mapper: multipath: Reinstating path 8:128.
[17756.307377] device-mapper: multipath: Failing path 8:128.


 Comments   
Comment by Oleg Drokin [ 21/Nov/19 ]

so it sounds like soak10 cannot connect to storage? Together with IB timeouts and such - some sort of an IB problem? or is storage not on IB?

Generated at Sat Feb 10 02:57:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.