-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.16.1
-
None
-
3
-
9223372036854775807
A simulated switch failure test was run on a Lustre fs using slingshot fabric, kfilnd Multi-rail servers, by pulling cables from one interface on each of the servers. Each server has two links, cxi0 and cxi1.
For this test, 2 MDS servers n02, n03, 2 OSS servers n04, n05, cxi1 cable pulled on n02, n05 and cxi0 cable pulled on n03, n04.
The Lustre fs was started, a client mount attempted, that client mount hung. On closer examination, we discovered the problem was between the two mds nodes, mdt0 could not connect to mdt1. mdt recovery_status on n02 showed:
[root@kjlmo702 ~]# lctl get_param mdt.*.recovery_status mdt.kjlmo7fs-MDT0000.recovery_status= status: WAITING non-ready MDTs: 0001 recovery_start: 1782420558 time_waited: 68395 [root@kjlmo702 ~]#
Sample osp import, state from n02 for mdt1:
[root@kjlmo702 ~]# lctl get_param osp.*.import
osp.kjlmo7fs-MDT0001-osp-MDT0000.import=
import:
name: kjlmo7fs-MDT0001-osp-MDT0000
target: kjlmo7fs-MDT0001_UUID
state: DISCONN
connect_flags: [ lov_index, version, acl, inode_bit_locks, adaptive_timeouts, mds_mds_connection, fid_is_enabled, full20, lfsck, multi_mod_rpcs, bulk_mbits, second_flags ]
connect_data:
flags: 0xa2400010450010a2
instance: 0
target_version: 2.16.1.1
target_index: 1
ibits_known: 0x2
max_mod_rpcs: 0
import_flags: [ replayable, connect_tried ]
connection:
failover_nids: [ "82@kfi", "0@lo" ]
nids_stats:
"82@kfi": { connects: 101, replied: 0, uptodate: false, sec_ago: 2 }
"0@lo": { connects: 99, replied: 0, uptodate: true, sec_ago: 1 }
current_connection: "0@lo"
connection_attempts: 200
generation: 2
in-progress_invalidations: 0
idle: 0 sec
rpcs:
inflight: 1
unregistering: 0
timeouts: 101
avg_waittime: 275 usecs
service_estimates:
services: 66 sec
network: 66 sec
transactions:
last_replay: 0
peer_committed: 0
last_checked: 0
osp.kjlmo7fs-MDT0001-osp-MDT0000.state=
current_state: DISCONN
state_history:
- [ 1782421874, CONNECTING ]
- [ 1782421875, DISCONN ]
- [ 1782421875, CONNECTING ]
- [ 1782421875, DISCONN ]
- [ 1782421880, CONNECTING ]
- [ 1782421881, DISCONN ]
- [ 1782421881, CONNECTING ]
- [ 1782421881, DISCONN ]
- [ 1782421899, CONNECTING ]
- [ 1782421900, DISCONN ]
- [ 1782421900, CONNECTING ]
- [ 1782421900, DISCONN ]
- [ 1782421905, CONNECTING ]
- [ 1782421907, DISCONN ]
- [ 1782421907, CONNECTING ]
- [ 1782421907, DISCONN ]
The fs uses a monitoring service that pings (ip + lnet) server peers to assess health of the host fabric links, and potentially failover targets if a server is determined to have lost connectivity. That service is started when the nodes are booted, it loads lnet. For this test, nodes were rebooted after cables were pulled, so each server had only a single lnet nid setup.
n02/n03 are an HA pair, n02 has a single target, combined mgt/mdt0. n03 has a single target, mdt1. After the monitoring service runs ping checks, the servers all show peer entries for the other servers, a single nid for each server (for the 'UP' interface). On n02, the n03 peer entry shows 19@kfi (cxi1 link), the primary, down lnet nid for n03 is 82@kfi. After the server targets are mounted, n02 shows separate peer entries for 19@kfi, 82@kfi. Both show multi-rail "True", but show only a single peer nid.
Full debug logs were captured, showing what looks like problems with peer addition for mdt1. The primary nid 82@kfi is created, then later the log shows:
00000100:00000040:26.0:1782752887.365754:0:2013903:0:(lustre_peer.c:119:class_add_uuid()) found uuid 82@kfi 19@kfi cnt=2 00000400:00001000:26.0:1782752887.365755:0:2013903:0:(api-ni.c:3222:LNetNIInit()) refs 2 00000400:00000200:26.0:1782752887.365756:0:2013903:0:(api-ni.c:1680:lnet_nid4_cpt_hash()) Match nid 82@kfi to cpt 4 00000400:00000200:26.0:1782752887.365757:0:2013903:0:(peer.c:1735:lnet_peer_add()) peer 82@kfi NID flags 0x100001: -17 00000400:00000200:26.0:1782752887.365758:0:2013903:0:(api-ni.c:1680:lnet_nid4_cpt_hash()) Match nid 19@kfi to cpt 4 00000400:00000200:26.0:1782752887.365759:0:2013903:0:(peer.c:1735:lnet_peer_add()) peer 19@kfi NID flags 0x1: -17 00000100:00000040:26.0:1782752887.365760:0:2013903:0:(lustre_peer.c:122:class_add_uuid()) Add peer 19@kfi rc = 0
error -17, EEXIST. mdt0 never establishes a connection to mdt1.
I'll attach n02 debug log.