Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.10.5
-
3
-
9223372036854775807
Description
sanity-hsm test_300 fails. From the suite_log for https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc, it looks like vm4 is nt in the ‘enabled’ state
CMD: trevis-11vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control Update not seen after 20s: wanted 'enabled' got 'stopped' sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1
In the console logs for MDS1 and MD3, vm4, we see some errors after the first unmount
[23667.706326] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1
[23667.728001] Lustre: Failing over lustre-MDT0000
[23668.148236] Lustre: server umount lustre-MDT0000 complete
[23668.254268] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[23668.254268] lctl dl | grep ' ST ' || true
[23668.528308] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.4.122@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[23668.528313] LustreError: Skipped 46 previous similar messages
[23669.539537] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation obd_ping to node 0@lo failed: rc = -107
[23669.539542] LustreError: Skipped 1 previous similar message
[23669.539547] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[23669.539548] Lustre: Skipped 1 previous similar message
[23678.359170] Lustre: DEBUG MARKER: hostname
[23678.465628] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
[23678.505905] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1
[23678.543360] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
[23678.613251] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[23678.631296] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[23678.631415] LustreError: 26046:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
[23678.631485] Lustre: MGS: Connection restored to 10.9.4.123@tcp (at 0@lo)
[23678.631486] Lustre: Skipped 27 previous similar messages
[23678.631517] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85d1 to 0x842e27191f8ace0f
[23678.730587] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[23678.779913] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
[23678.812256] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[23678.850860] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test
[23679.059833] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 7 clients reconnect
[23679.110502] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[23679.152771] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[23679.184653] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
[23679.223601] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[23679.261448] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[23679.304631] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
[23679.637253] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23679.641791] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23679.696035] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23679.698094] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23683.686751] LustreError: 167-0: lustre-MDT0000-lwp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[23683.789089] Lustre: lustre-MDT0000: Recovery over after 0:05, of 7 clients 7 recovered and 0 were evicted.
[23684.820279] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
[23684.832756] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
[23684.876149] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
[23684.884301] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
[23685.942378] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23685.977910] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23686.095857] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
[23686.132021] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
[23686.252173] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0000.hsm_control=enabled
[23686.270502] Lustre: Setting parameter lustre-MDT0000.mdt.lustre-MDT0000.hsm_control in log params
[23686.270512] Lustre: Skipped 3 previous similar messages
[23686.299219] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0001.hsm_control=enabled
[23686.339891] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0002.hsm_control=enabled
[23686.380162] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0003.hsm_control=enabled
[23686.850519] LustreError: 17275:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping
[23706.425954] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23706.466717] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23706.590506] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
[23706.631145] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
[23706.761098] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[23706.798299] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1
[23707.172546] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[23707.172546] lctl dl | grep ' ST ' || true
[23717.088225] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314513/real 1534314513] req@ffff88004da74c40 x1608820623035712/t0(0) o400->MGC10.9.4.123@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1534314520 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[23717.088234] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[23717.280798] Lustre: DEBUG MARKER: hostname
[23717.389356] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
[23717.431911] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1
[23717.468621] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
[23717.523820] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[23717.595775] Lustre: MGS: Connection restored to 10.9.4.124@tcp (at 10.9.4.124@tcp)
[23717.595781] Lustre: Skipped 27 previous similar messages
[23723.327601] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[23723.365896] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test
[23723.621406] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[23723.665936] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[23723.695487] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
[23723.732499] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[23723.766970] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[23723.807809] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
[23724.110534] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23724.115679] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23724.163653] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23724.168911] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[23728.297558] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
[23728.298167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
[23728.354498] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
[23728.355846] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
[23729.415037] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23729.451571] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23730.491135] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23731.531240] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23732.570962] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23733.609891] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23734.648627] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23735.686065] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23736.723799] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23737.762782] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23738.801398] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23739.839530] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23740.878463] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23741.917538] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23742.956405] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23743.994575] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23745.034297] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23746.074069] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23747.111516] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23748.149274] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23749.187205] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23750.223793] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
[23750.331633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1
[23750.375715] Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1
Similarly, we have some errors on MDS2 and MDS4, vm5,
[23670.655388] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107 [23670.655392] LustreError: Skipped 3 previous similar messages [23670.655401] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete [23670.655402] Lustre: Skipped 3 previous similar messages [23678.733466] Lustre: lustre-MDT0001: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp) [23678.733473] Lustre: Skipped 6 previous similar messages [23679.171633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.211468] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.702724] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.703291] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.751953] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.753448] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23680.656535] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [23680.656543] LustreError: Skipped 1 previous similar message [23680.656598] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail [23680.660167] LustreError: 16750:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 [23680.660984] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85c3 to 0x842e27191f8ad142 [23684.885402] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.891508] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.937278] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23684.946568] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec [23686.078066] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23686.114999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23686.231366] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23686.270114] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23689.384702] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0001: Coordinator already started or stopping [23689.384707] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) Skipped 1 previous similar message [23706.570364] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23706.607395] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control [23706.731616] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23706.766678] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control [23717.651504] Lustre: 2894:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314515/real 1534314515] req@ffff88005b13f040 x1608820615336448/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [23717.656643] Lustre: MGC10.9.4.123@tcp: Connection restored to MGC10.9.4.123@tcp_0 (at 10.9.4.123@tcp) [23717.656646] Lustre: Skipped 6 previous similar messages [23723.678167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23723.720086] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23724.169208] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.181342] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.226365] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23724.233859] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23728.355778] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.361315] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.409092] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23728.415257] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec [23750.389384] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1
The OSS also has some connection issues
[23666.783662] Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount ====================== 23:27:51 (1534314471) [23668.572258] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107 [23668.572275] LustreError: Skipped 7 previous similar messages [23668.572283] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete [23668.572284] Lustre: Skipped 7 previous similar messages [23678.727567] Lustre: lustre-OST0000: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp) [23678.727571] Lustre: Skipped 8 previous similar messages [23679.158294] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.203043] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [23679.688823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.695586] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.741636] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23679.741868] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid [23680.570151] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314478/real 1534314478] req@ffff8800739c5c80 x1608820631664480/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314485 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [23680.570154] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message [23680.570167] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail [23680.570268] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) failed processing log: -5 [23680.570269] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) Skipped 1 previous similar message [23680.573305] Lustre: Evicted from MGS (at 10.9.4.123@tcp) after server handle changed from 0x842e27191f8a8935 to 0x842e27191f8ad085
We’ve seen two instances of this failure
2018-08-04 2.10.5 RC1 CentOS7 DNE/ZFS
https://testing.whamcloud.com/test_sets/797d69de-97a8-11e8-a9f7-52540065bddc
2018-08-15 2.10.5 RC2 SLES12 SP3 DNE/ldiskfs
https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc