So on the MDS I see the following lctl dump:
00000100:00080000:5.0:1508798556.578627:0:4131:0:(pinger.c:405:ptlrpc_pinger_add_import()) adding pingable import 19afc095-abef-a794-2f84-9099c3e67329->MGS
00000020:01000004:5.0:1508798556.578635:0:4131:0:(obd_mount_server.c:1303:server_start_targets()) starting target sultan-MDT0000
00000020:01000004:5.0:1508798556.578694:0:4131:0:(obd_mount.c:193:lustre_start_simple()) Starting obd MDS (typ=mds)
00000020:00000080:5.0:1508798556.578696:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001
00000020:00000080:5.0:1508798556.623854:0:4131:0:(genops.c:414:class_newdev()) Allocate new device MDS (ffff8817d14c8000)
00000020:00000080:5.0:1508798556.623939:0:4131:0:(obd_config.c:431:class_attach()) OBD: dev 2 attached type mds with refcount 1
00000020:00000080:5.0:1508798556.623945:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003
00000020:00000080:7.0:1508798556.670941:0:4131:0:(obd_config.c:542:class_setup()) finished setup of obd MDS (uuid MDS_uuid)
00000020:01000004:7.0:1508798556.670956:0:4131:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/sda
00000040:01000000:7.0:1508798556.673106:0:4131:0:(llog_obd.c:210:llog_setup()) obd MGC10.37.248.67@o2ib1 ctxt 0 is initialized
00000020:01000004:7.0:1508798556.673119:0:4131:0:(obd_mount_server.c:1208:server_register_target()) Registration sultan-MDT0000, fs=sultan, 10.37.248.155@o2ib1
, index=0000, flags=0x1
10000000:01000000:7.0:1508798556.673122:0:4131:0:(mgc_request.c:1253:mgc_set_info_async()) register_target sultan-MDT0000 0x10000001
10000000:01000000:7.0:1508798556.673144:0:4131:0:(mgc_request.c:1203:mgc_target_register()) register sultan-MDT0000
00000100:00080000:7.0:1508798556.673152:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
x1582089946267680/t0(0) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00000400:3.0F:1508798561.578402:0:3989:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508798556/r
eal 0] req@ffff8817d1b40000 x1582089946267664/t0(0) o250->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 520/544 e 0 to 1 dl 1508798561 ref 2 fl Rpc:XN/0
/ffffffff rc 0/-1
00000100:00080000:7.0:1508798567.672382:0:4131:0:(client.c:1170:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8817cc750000 x1582089946267680/t0(0
) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000020:00080000:7.0:1508798567.672401:0:4131:0:(obd_mount_server.c:1233:server_register_target()) sultan-MDT0000: error registering with the MGS: rc = -110 (
not fatal)
00000020:01000004:7.0:1508798567.672406:0:4131:0:(obd_mount_server.c:117:server_register_mount()) register mount ffff8817df73f800 from sultan-MDT0000
10000000:01000000:7.0:1508798567.672412:0:4131:0:(mgc_request.c:2197:mgc_process_config()) parse_log sultan-MDT0000 from 0
10000000:01000000:7.0:1508798567.672413:0:4131:0:(mgc_request.c:331:config_log_add()) adding config log sultan-MDT0000: (null)
10000000:01000000:7.0:1508798567.672416:0:4131:0:(mgc_request.c:211:do_config_log_add()) do adding config log sultan-sptlrpc: (null)
10000000:01000000:7.0:1508798567.672419:0:4131:0:(mgc_request.c:90:mgc_name2resid()) log sultan-sptlrpc to resid 0x6e61746c7573/0x0 (sultan)
10000000:01000000:7.0:1508798567.672425:0:4131:0:(mgc_request.c:2062:mgc_process_log()) Process log sultan-sptlrpc: (null) from 1
10000000:01000000:7.0:1508798567.672427:0:4131:0:(mgc_request.c:1130:mgc_enqueue()) Enqueue for sultan-sptlrpc (res 0x6e61746c7573)
00000100:00080000:7.0:1508798567.672459:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
x1582089946267696/t0(0) o101->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000800:00000400:6.0:1508798569.567382:0:3842:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.37.248.67@o2ib1: 4295778 seconds
00000100:00080000:1.0:1508798569.578255:0:3989:0:(import.c:1289:ptlrpc_connect_interpret()) ffff8817e66a2800 MGS: changing import state from CONNECTING to DISC
ONN
00000100:00080000:1.0:1508798569.578260:0:3989:0:(import.c:1336:ptlrpc_connect_interpret()) recovery of MGS on MGC10.37.248.67@o2ib1_0 failed (-110)
Do any of the nodes have multiple interfaces? If so, can you please make sure you follow this general linux routing guideline:
https://wiki.hpdd.intel.com/display/LNet/MR+Cluster+Setup