[LU-12990] MDS failed to mount during failover Created: 20/Nov/19 Updated: 21/Nov/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
soak triggered mds_failover testing. According to the soak.log, MDT0003 from the failing MDS(soak-11) should be mounted on failover pair soak-10, but it didn't. soak.log 2019-11-19 01:22:13,931:fsmgmt.fsmgmt:INFO trying to connect to soak-11 ... 2019-11-19 01:22:20,107:fsmgmt.fsmgmt:INFO trying to connect to soak-11 ... 2019-11-19 01:22:25,285:fsmgmt.fsmgmt:INFO trying to connect to soak-11 ... 2019-11-19 01:22:26,296:fsmgmt.fsmgmt:INFO soak-11 is up!!! 2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO Failing over soaked-MDT0003 ... 2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on soak-10 ... Here is the console log on soak-10 around that time [17741.278456] device-mapper: multipath: Failing path 8:128. [17746.279544] device-mapper: multipath: Reinstating path 8:128. [17746.286032] device-mapper: multipath: Failing path 8:128. [17747.871994] LNetError: 6527:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [17747.883281] LNetError: 6527:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 192.168.1.111@o2ib (10): c: 7, oc: 0, rc: 8 [17747.897005] LNetError: 6533:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.111@o2ib added to recovery queue. Health = 900 [17747.911953] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 192.168.1.110@o2ib added to recovery queue. Health = 900 [17747.925462] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 5 previous similar messages [17747.937096] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1574126428/real 1574126433] req@ffff899fea070000 x1650580018692480/t0(0) o41->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib:24/4 lens 224/368 e 0 to 1 dl 1574126435 ref 1 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'' [17747.970231] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 1 previous similar message [17747.980974] Lustre: soaked-MDT0003-osp-MDT0002: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete [17751.292982] device-mapper: multipath: Reinstating path 8:128. [17751.299695] device-mapper: multipath: Failing path 8:128. [17756.300695] device-mapper: multipath: Reinstating path 8:128. [17756.307377] device-mapper: multipath: Failing path 8:128. |
| Comments |
| Comment by Oleg Drokin [ 21/Nov/19 ] |
|
so it sounds like soak10 cannot connect to storage? Together with IB timeouts and such - some sort of an IB problem? or is storage not on IB? |