So on the MDS I see the following lctl dump:
00000100:00080000:5.0:1508798556.578627:0:4131:0:(pinger.c:405:ptlrpc_pinger_add_import()) adding pingable import 19afc095-abef-a794-2f84-9099c3e67329->MGS
00000020:01000004:5.0:1508798556.578635:0:4131:0:(obd_mount_server.c:1303:server_start_targets()) starting target sultan-MDT0000
00000020:01000004:5.0:1508798556.578694:0:4131:0:(obd_mount.c:193:lustre_start_simple()) Starting obd MDS (typ=mds)
00000020:00000080:5.0:1508798556.578696:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001
00000020:00000080:5.0:1508798556.623854:0:4131:0:(genops.c:414:class_newdev()) Allocate new device MDS (ffff8817d14c8000)
00000020:00000080:5.0:1508798556.623939:0:4131:0:(obd_config.c:431:class_attach()) OBD: dev 2 attached type mds with refcount 1
00000020:00000080:5.0:1508798556.623945:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003
00000020:00000080:7.0:1508798556.670941:0:4131:0:(obd_config.c:542:class_setup()) finished setup of obd MDS (uuid MDS_uuid)
00000020:01000004:7.0:1508798556.670956:0:4131:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/sda
00000040:01000000:7.0:1508798556.673106:0:4131:0:(llog_obd.c:210:llog_setup()) obd MGC10.37.248.67@o2ib1 ctxt 0 is initialized
00000020:01000004:7.0:1508798556.673119:0:4131:0:(obd_mount_server.c:1208:server_register_target()) Registration sultan-MDT0000, fs=sultan, 10.37.248.155@o2ib1
, index=0000, flags=0x1
10000000:01000000:7.0:1508798556.673122:0:4131:0:(mgc_request.c:1253:mgc_set_info_async()) register_target sultan-MDT0000 0x10000001
10000000:01000000:7.0:1508798556.673144:0:4131:0:(mgc_request.c:1203:mgc_target_register()) register sultan-MDT0000
00000100:00080000:7.0:1508798556.673152:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
x1582089946267680/t0(0) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00000400:3.0F:1508798561.578402:0:3989:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508798556/r
eal 0] req@ffff8817d1b40000 x1582089946267664/t0(0) o250->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 520/544 e 0 to 1 dl 1508798561 ref 2 fl Rpc:XN/0
/ffffffff rc 0/-1
00000100:00080000:7.0:1508798567.672382:0:4131:0:(client.c:1170:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8817cc750000 x1582089946267680/t0(0
) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000020:00080000:7.0:1508798567.672401:0:4131:0:(obd_mount_server.c:1233:server_register_target()) sultan-MDT0000: error registering with the MGS: rc = -110 (
not fatal)
00000020:01000004:7.0:1508798567.672406:0:4131:0:(obd_mount_server.c:117:server_register_mount()) register mount ffff8817df73f800 from sultan-MDT0000
10000000:01000000:7.0:1508798567.672412:0:4131:0:(mgc_request.c:2197:mgc_process_config()) parse_log sultan-MDT0000 from 0
10000000:01000000:7.0:1508798567.672413:0:4131:0:(mgc_request.c:331:config_log_add()) adding config log sultan-MDT0000: (null)
10000000:01000000:7.0:1508798567.672416:0:4131:0:(mgc_request.c:211:do_config_log_add()) do adding config log sultan-sptlrpc: (null)
10000000:01000000:7.0:1508798567.672419:0:4131:0:(mgc_request.c:90:mgc_name2resid()) log sultan-sptlrpc to resid 0x6e61746c7573/0x0 (sultan)
10000000:01000000:7.0:1508798567.672425:0:4131:0:(mgc_request.c:2062:mgc_process_log()) Process log sultan-sptlrpc: (null) from 1
10000000:01000000:7.0:1508798567.672427:0:4131:0:(mgc_request.c:1130:mgc_enqueue()) Enqueue for sultan-sptlrpc (res 0x6e61746c7573)
00000100:00080000:7.0:1508798567.672459:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
x1582089946267696/t0(0) o101->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000800:00000400:6.0:1508798569.567382:0:3842:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.37.248.67@o2ib1: 4295778 seconds
00000100:00080000:1.0:1508798569.578255:0:3989:0:(import.c:1289:ptlrpc_connect_interpret()) ffff8817e66a2800 MGS: changing import state from CONNECTING to DISC
ONN
00000100:00080000:1.0:1508798569.578260:0:3989:0:(import.c:1336:ptlrpc_connect_interpret()) recovery of MGS on MGC10.37.248.67@o2ib1_0 failed (-110)
Okay I git bisect to see when this failure started to happen and its due to the multirail support landing. Currently people moving to lustre 2.10 might find they can't mount lustre at all when deploying a production system.