[LU-12684] MDT failed to mount during failover due to LNetError Created: 22/Aug/19  Updated: 22/Aug/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

lustre-b2_12-ib #35


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

SOAK hit following error during MDS failover after has been running for 4 days

During failover, MDT3 failed to mount on soak-10 due to network error

syslog on soak-10

Aug 19 21:26:21 soak-10 kernel: LNetError: 12284:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Aug 19 21:26:21 soak-10 kernel: LNetError: 12284:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 192.168.1.111@o2ib (9): c: 5, oc: 0, rc: 8
Aug 19 21:26:21 soak-10 kernel: Lustre: 12317:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1566249976/re
al 1566249981]  req@ffff9678ecee8900 x1642318302104592/t0(0) o41->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib:24/4 lens 224/368 e 0 to 1 dl 1566250020 ref 1 fl
 Rpc:eX/0/ffffffff rc 0/-1
Aug 19 21:26:21 soak-10 kernel: Lustre: soaked-MDT0003-osp-MDT0002: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using t
his service will wait for recovery to complete
Aug 19 21:26:21 soak-10 kernel: Lustre: Skipped 3 previous similar messages
Aug 19 21:26:21 soak-10 kernel: Lustre: 12317:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Aug 19 21:26:24 soak-10 multipathd: 360080e50001fedb80000015952012962: sdi - rdac checker reports path is ghost

console log on soak-10

[13420.972492] LNetError: 12284:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[13420.983876] LNetError: 12284:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 192.168.1.111@o2ib (9): c: 5, oc: 0, rc: 8
[13420.997683] Lustre: 12317:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1566249976/real 1566249981]  req@ffff9678ecee8900 x1642318302104592/t0(0) o41->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib:24/4 lens 224/368 e 0 to 1 dl 1566250020 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
[13420.997711] Lustre: soaked-MDT0003-osp-MDT0002: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[13420.997714] Lustre: Skipped 3 previous similar messages
[13421.054508] Lustre: 12317:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[13423.880342] device-mapper: multipath: Reinstating path 8:128.
[13423.887079] device-mapper: multipath: Failing path 8:128.

Generated at Sat Feb 10 02:54:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.