Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11263

sanity-hsm test 300 fails with 'hsm_control state is not 'enabled' on mds1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.10.5
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_300 fails. From the suite_log for https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc, it looks like vm4 is nt in the ‘enabled’ state

      CMD: trevis-11vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      Update not seen after 20s: wanted 'enabled' got 'stopped'
       sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1 
      

      In the console logs for MDS1 and MD3, vm4, we see some errors after the first unmount

      [23667.706326] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1
      [23667.728001] Lustre: Failing over lustre-MDT0000
      [23668.148236] Lustre: server umount lustre-MDT0000 complete
      [23668.254268] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      [23668.254268] lctl dl | grep ' ST ' || true
      [23668.528308] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.4.122@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [23668.528313] LustreError: Skipped 46 previous similar messages
      [23669.539537] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation obd_ping to node 0@lo failed: rc = -107
      [23669.539542] LustreError: Skipped 1 previous similar message
      [23669.539547] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [23669.539548] Lustre: Skipped 1 previous similar message
      [23678.359170] Lustre: DEBUG MARKER: hostname
      [23678.465628] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
      [23678.505905] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1
      [23678.543360] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      [23678.613251] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [23678.631296] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      [23678.631415] LustreError: 26046:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
      [23678.631485] Lustre: MGS: Connection restored to 10.9.4.123@tcp (at 0@lo)
      [23678.631486] Lustre: Skipped 27 previous similar messages
      [23678.631517] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85d1 to 0x842e27191f8ace0f
      [23678.730587] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      [23678.779913] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
      [23678.812256] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      [23678.850860] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test
      [23679.059833] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 7 clients reconnect
      [23679.110502] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.152771] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.184653] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      [23679.223601] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      [23679.261448] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      [23679.304631] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      [23679.637253] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.641791] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.696035] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.698094] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23683.686751] LustreError: 167-0: lustre-MDT0000-lwp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [23683.789089] Lustre: lustre-MDT0000: Recovery over after 0:05, of 7 clients 7 recovered and 0 were evicted.
      [23684.820279] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.832756] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.876149] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.884301] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23685.942378] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23685.977910] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23686.095857] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
      [23686.132021] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
      [23686.252173] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0000.hsm_control=enabled
      [23686.270502] Lustre: Setting parameter lustre-MDT0000.mdt.lustre-MDT0000.hsm_control in log params
      [23686.270512] Lustre: Skipped 3 previous similar messages
      [23686.299219] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0001.hsm_control=enabled
      [23686.339891] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0002.hsm_control=enabled
      [23686.380162] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -P mdt.lustre-MDT0003.hsm_control=enabled
      [23686.850519] LustreError: 17275:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0000: Coordinator already started or stopping
      [23706.425954] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23706.466717] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23706.590506] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
      [23706.631145] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0002.hsm_control
      [23706.761098] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [23706.798299] Lustre: DEBUG MARKER: umount /mnt/lustre-mds1
      [23707.172546] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      [23707.172546] lctl dl | grep ' ST ' || true
      [23717.088225] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314513/real 1534314513]  req@ffff88004da74c40 x1608820623035712/t0(0) o400->MGC10.9.4.123@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1534314520 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [23717.088234] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [23717.280798] Lustre: DEBUG MARKER: hostname
      [23717.389356] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
      [23717.431911] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1
      [23717.468621] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      [23717.523820] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [23717.595775] Lustre: MGS: Connection restored to 10.9.4.124@tcp (at 10.9.4.124@tcp)
      [23717.595781] Lustre: Skipped 27 previous similar messages
      [23723.327601] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      [23723.365896] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/test
      [23723.621406] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23723.665936] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23723.695487] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      [23723.732499] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      [23723.766970] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      [23723.807809] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      [23724.110534] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.115679] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.163653] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.168911] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23728.297558] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.298167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.354498] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.355846] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23729.415037] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23729.451571] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23730.491135] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23731.531240] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23732.570962] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23733.609891] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23734.648627] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23735.686065] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23736.723799] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23737.762782] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23738.801398] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23739.839530] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23740.878463] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23741.917538] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23742.956405] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23743.994575] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23745.034297] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23746.074069] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23747.111516] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23748.149274] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23749.187205] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23750.223793] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm_control
      [23750.331633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1 
      [23750.375715] Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'enabled' on mds1
      

      Similarly, we have some errors on MDS2 and MDS4, vm5,

      [23670.655388] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107
      [23670.655392] LustreError: Skipped 3 previous similar messages
      [23670.655401] Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [23670.655402] Lustre: Skipped 3 previous similar messages
      [23678.733466] Lustre: lustre-MDT0001: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp)
      [23678.733473] Lustre: Skipped 6 previous similar messages
      [23679.171633] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.211468] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.702724] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.703291] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.751953] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.753448] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23680.656535] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [23680.656543] LustreError: Skipped 1 previous similar message
      [23680.656598] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail
      [23680.660167] LustreError: 16750:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
      [23680.660984] Lustre: Evicted from MGS (at MGC10.9.4.123@tcp_0) after server handle changed from 0x842e27191f8a85c3 to 0x842e27191f8ad142
      [23684.885402] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.891508] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.937278] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23684.946568] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      [23686.078066] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control
      [23686.114999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control
      [23686.231366] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control
      [23686.270114] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control
      [23689.384702] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) lustre-MDT0001: Coordinator already started or stopping
      [23689.384707] LustreError: 8711:0:(mdt_coordinator.c:1005:mdt_hsm_cdt_start()) Skipped 1 previous similar message
      [23706.570364] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control
      [23706.607395] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0001.hsm_control
      [23706.731616] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control
      [23706.766678] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0003.hsm_control
      [23717.651504] Lustre: 2894:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314515/real 1534314515]  req@ffff88005b13f040 x1608820615336448/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [23717.656643] Lustre: MGC10.9.4.123@tcp: Connection restored to MGC10.9.4.123@tcp_0 (at 10.9.4.123@tcp)
      [23717.656646] Lustre: Skipped 6 previous similar messages
      [23723.678167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23723.720086] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23724.169208] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.181342] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.226365] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23724.233859] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23728.355778] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.361315] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.409092] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23728.415257] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      [23750.389384] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not \'enabled\' on mds1 
      

      The OSS also has some connection issues

      [23666.783662] Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount ====================== 23:27:51 (1534314471)
      [23668.572258] LustreError: 11-0: lustre-MDT0000-lwp-OST0000: operation obd_ping to node 10.9.4.123@tcp failed: rc = -107
      [23668.572275] LustreError: Skipped 7 previous similar messages
      [23668.572283] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.9.4.123@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [23668.572284] Lustre: Skipped 7 previous similar messages
      [23678.727567] Lustre: lustre-OST0000: Connection restored to 10.9.4.123@tcp (at 10.9.4.123@tcp)
      [23678.727571] Lustre: Skipped 8 previous similar messages
      [23679.158294] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.203043] Lustre: DEBUG MARKER: trevis-11vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [23679.688823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.695586] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.741636] Lustre: DEBUG MARKER: trevis-11vm2: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23679.741868] Lustre: DEBUG MARKER: trevis-11vm1: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [23680.570151] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1534314478/real 1534314478]  req@ffff8800739c5c80 x1608820631664480/t0(0) o400->MGC10.9.4.123@tcp@10.9.4.123@tcp:26/25 lens 224/224 e 0 to 1 dl 1534314485 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [23680.570154] Lustre: 3979:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [23680.570167] LustreError: 166-1: MGC10.9.4.123@tcp: Connection to MGS (at 10.9.4.123@tcp) was lost; in progress operations using this service will fail
      [23680.570268] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
      [23680.570269] LustreError: 16611:0:(mgc_request.c:603:do_requeue()) Skipped 1 previous similar message
      [23680.573305] Lustre: Evicted from MGS (at 10.9.4.123@tcp) after server handle changed from 0x842e27191f8a8935 to 0x842e27191f8ad085
      

      We’ve seen two instances of this failure
      2018-08-04 2.10.5 RC1 CentOS7 DNE/ZFS
      https://testing.whamcloud.com/test_sets/797d69de-97a8-11e8-a9f7-52540065bddc
      2018-08-15 2.10.5 RC2 SLES12 SP3 DNE/ldiskfs
      https://testing.whamcloud.com/test_sets/cb44d736-a0de-11e8-8ee3-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: