Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.10.5
-
3
-
9223372036854775807
Description
sanity-hsm test_300 fails. From the suite_log for https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc, it looks like vm4 is nt in the ‘enabled’ state
CMD: trevis-11vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control Update not seen after 20s: wanted 'enabled' got 'stopped' sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1
In the console logs for MDS1 and MD3, vm4, we see some errors after the first unmount
[23667.706326] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1 [23667.728001] Lustre: Failing over lustre-MDT0000 [23668.148236] Lustre: server umount lustre-MDT0000 complete [23668.254268] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [23668.254268] lctl dl | grep ' ST ' || true [23668.528308] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.4.122@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [23668.528313] LustreError: Skipped 46 previous similar messages [23669.539537] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation obd_ping to node 0@lo failed: rc = -107 [23669.539542] LustreError: Skipped 1 previous similar message [23669.539547] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [23669.539548] Lustre: Skipped 1 previous similar message [23678.359170] Lustre: DEBUG MARKER: hostname [23678.465628] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1 [23678.505905] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 [23678.543360] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 [23678.613251] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [23678.631296] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail [23678.631415] LustreError: 26046:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 [23678.631485] Lustre: MGS: Connection restored to 10.9.4.123@tcp (at 0@lo) [23678.631486] Lustre: Skipped 27 previous similar messages [23678.631517] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85d1 to 0x842e27191f8ace0f [23678.730587] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 [23678.779913] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect [23678.812256] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check [23678.850860] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test [23679.059833] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 7 clients reconnect [23679.110502] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.152771] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.184653] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 [23679.223601] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' [23679.261448] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' [23679.304631] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null [23679.637253] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.641791] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.696035] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.698094] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23683.686751] LustreError: 167-0: lustre-MDT0000-lwp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [23683.789089] Lustre: lustre-MDT0000: Recovery over after 0:05, of 7 clients 7 recovered and 0 were evicted. [23684.820279] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.832756] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.876149] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.884301] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23685.942378] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23685.977910] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23686.095857] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control [23686.132021] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control [23686.252173] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0000.hsm_control=enabled [23686.270502] Lustre: Setting parameter lustre-MDT0000.mdt.lustre-MDT0000.hsm_control in log params [23686.270512] Lustre: Skipped 3 previous similar messages [23686.299219] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0001.hsm_control=enabled [23686.339891] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0002.hsm_control=enabled [23686.380162] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0003.hsm_control=enabled [23686.850519] LustreError: 17275:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping [23706.425954] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23706.466717] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23706.590506] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control [23706.631145] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control [23706.761098] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [23706.798299] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1 [23707.172546] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && [23707.172546] lctl dl | grep ' ST ' || true [23717.088225] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314513/real 1534314513] req@ffff88004da74c40 x1608820623035712/t0(0) o400->MGC10.9.4.123@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1534314520 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [23717.088234] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message [23717.280798] Lustre: DEBUG MARKER: hostname [23717.389356] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1 [23717.431911] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 [23717.468621] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 [23717.523820] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [23717.595775] Lustre: MGS: Connection restored to 10.9.4.124@tcp (at 10.9.4.124@tcp) [23717.595781] Lustre: Skipped 27 previous similar messages [23723.327601] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check [23723.365896] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test [23723.621406] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23723.665936] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23723.695487] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 [23723.732499] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' [23723.766970] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' [23723.807809] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null [23724.110534] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.115679] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.163653] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.168911] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23728.297558] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.298167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.354498] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.355846] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23729.415037] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23729.451571] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23730.491135] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23731.531240] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23732.570962] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23733.609891] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23734.648627] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23735.686065] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23736.723799] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23737.762782] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23738.801398] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23739.839530] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23740.878463] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23741.917538] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23742.956405] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23743.994575] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23745.034297] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23746.074069] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23747.111516] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23748.149274] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23749.187205] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23750.223793] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control [23750.331633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1 [23750.375715] Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1
Similarly, we have some errors on MDS2 and MDS4, vm5,
[23670.655388] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107 [23670.655392] LustreError: Skipped 3 previous similar messages [23670.655401] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete [23670.655402] Lustre: Skipped 3 previous similar messages [23678.733466] Lustre: lustre-MDT0001: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp) [23678.733473] Lustre: Skipped 6 previous similar messages [23679.171633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.211468] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.702724] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.703291] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.751953] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.753448] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23680.656535] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [23680.656543] LustreError: Skipped 1 previous similar message [23680.656598] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail [23680.660167] LustreError: 16750:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 [23680.660984] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85c3 to 0x842e27191f8ad142 [23684.885402] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.891508] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.937278] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.946568] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23686.078066] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23686.114999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23686.231366] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23686.270114] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23689.384702] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0001: Coordinator already started or stopping [23689.384707] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) Skipped 1 previous similar message [23706.570364] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23706.607395] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23706.731616] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23706.766678] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23717.651504] Lustre: 2894:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314515/real 1534314515] req@ffff88005b13f040 x1608820615336448/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [23717.656643] Lustre: MGC10.9.4.123@tcp: Connection restored to MGC10.9.4.123@tcp_0 (at 10.9.4.123@tcp) [23717.656646] Lustre: Skipped 6 previous similar messages [23723.678167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23723.720086] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23724.169208] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.181342] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.226365] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.233859] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23728.355778] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.361315] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.409092] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.415257] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23750.389384] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1
The OSS also has some connection issues
[23666.783662] Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount ====================== 23:27:51 (1534314471) [23668.572258] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107 [23668.572275] LustreError: Skipped 7 previous similar messages [23668.572283] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete [23668.572284] Lustre: Skipped 7 previous similar messages [23678.727567] Lustre: lustre-OST0000: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp) [23678.727571] Lustre: Skipped 8 previous similar messages [23679.158294] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.203043] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.688823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.695586] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.741636] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.741868] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23680.570151] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314478/real 1534314478] req@ffff8800739c5c80 x1608820631664480/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314485 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [23680.570154] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message [23680.570167] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail [23680.570268] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 [23680.570269] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) Skipped 1 previous similar message [23680.573305] Lustre: Evicted from MGS (at 10.9.4.123@tcp) after server handle changed from 0x842e27191f8a8935 to 0x842e27191f8ad085
We’ve seen two instances of this failure
2018-08-04 2.10.5 RC1 CentOS7 DNE/ZFS
https://testing.whamcloud.com/test_sets/797d69de-97a8-11e8-a9f7-52540065bddc
2018-08-15 2.10.5 RC2 SLES12 SP3 DNE/ldiskfs
https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc