Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
-
None
-
DNE
-
3
-
9223372036854775807
Description
sanity test_208 fails in test 4: lease can sustain over recovery with 'lease broken over recovery'. These failures started on 8 April 2019. So far, this test is only failing for DNE testing.
Looking at the test_suite log for a recent failure, https://testing.whamcloud.com/test_sets/03bcb85a-5ace-11e9-9720-52540065bddc , we see
== sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264) ==== test 1: verify get lease work read lease(1) has applied. ==== test 2: verify lease can be broken by upcoming open no lease applied. ==== test 3: verify lease can't be granted if an open already exists multiop: cannot get READ lease, ext 0: Device or resource busy (16) multiop: apply/unlock lease error: Device or resource busy ==== test 4: lease can sustain over recovery Failing mds1 on trevis-40vm9 … trevis-40vm6: trevis-40vm6.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid trevis-40vm7: trevis-40vm7.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid trevis-40vm7: CMD: trevis-40vm7.trevis.whamcloud.com lctl get_param -n at_max trevis-40vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec trevis-40vm6: CMD: trevis-40vm6.trevis.whamcloud.com lctl get_param -n at_max trevis-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec multiop: expect lease exists no lease applied. sanity test_208: @@@@@@ FAIL: lease broken over recovery
Looking at the MDS1, 3 (vm9) dmesg log, we see some errors,
[ 7777.539199] Lustre: DEBUG MARKER: == sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264) [ 7780.799167] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [ 7781.106991] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 7781.266880] Lustre: Failing over lustre-MDT0000 [ 7781.267747] Lustre: Skipped 1 previous similar message [ 7781.457907] Lustre: lustre-MDT0000: Not available for connect from 10.9.3.150@tcp (stopping) [ 7781.459369] Lustre: Skipped 19 previous similar messages [ 7781.460162] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107 [ 7781.460167] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [ 7781.460168] Lustre: Skipped 1 previous similar message [ 7781.612729] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.234@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 7781.615597] LustreError: Skipped 5 previous similar messages [ 7781.631586] Lustre: server umount lustre-MDT0000 complete [ 7782.702609] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true [ 7783.012452] Lustre: DEBUG MARKER: modprobe dm-flakey; dmsetup targets | grep -q flakey [ 7793.042085] Lustre: 17623:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554811273/real 1554811273] req@ffff8bf39b3e0900 x1630329757439344/t0(0) o400->MGC10.9.3.149@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1554811280 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 7793.046530] LustreError: 166-1: MGC10.9.3.149@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail [ 7793.384485] Lustre: DEBUG MARKER: hostname [ 7793.752701] Lustre: DEBUG MARKER: modprobe dm-flakey; dmsetup targets | grep -q flakey [ 7794.077024] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 [ 7794.380875] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1 [ 7794.685614] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey [ 7794.986331] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey [ 7795.394940] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/mapper/mds1_flakey /mnt/lustre-mds1 [ 7795.730148] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 7795.756692] Lustre: osd-ldiskfs create tunables for lustre-MDT0000 [ 7798.648231] Lustre: MGS: Connection restored to bfee9086-166d-21e0-ae35-a479c8b23d83 (at 10.9.5.234@tcp) [ 7798.650043] Lustre: Skipped 66 previous similar messages [ 7799.066744] Lustre: Evicted from MGS (at 10.9.3.149@tcp) after server handle changed from 0x8eaf7145dd9f3b79 to 0x8eaf7145ddb23d36 [ 7799.411459] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 [ 7799.801070] Lustre: 25291:0:(llog.c:606:llog_process_thread()) lustre-MDT0002-osp-MDT0000: invalid length 0 in llog [0x1:0x80000402:0x2]record for index 0/6 [ 7799.804286] LustreError: 25291:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22 [ 7799.808197] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect [ 7799.810304] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 7799.812067] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) Skipped 1 previous similar message [ 7800.264478] Lustre: lustre-MDT0000: disconnecting 5 stale clients [ 7800.289476] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0001-mdtlov_UUID (at 10.9.3.150@tcp), waiting for 5 known clients (0 recovered, 0 in progress, and 5 evicted) already passed deadline 130:00 [ 7800.596879] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
Although LU-8466 is for the same test failure with the same error message, there was
Here are links to logs for recent failures:
https://testing.whamcloud.com/test_sets/19290212-5b04-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d9ae92b0-5a76-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d84fe1d8-5a3c-11e9-8e92-52540065bddc
Attachments
Issue Links
- is duplicated by
-
LU-12171 sanity test_133g: Timeout occurred after 161 mins
- Resolved
-
LU-12176 recovery-small test 107 fails with 'test_107 failed with 1'
- Resolved
- is related to
-
LU-12210 Test failures associated with DNE index and stripe count randomization
- Open
- is related to
-
LU-8466 sanity test_208: @@@@@@ FAIL: lease broken over recovery
- Resolved
-
LU-11636 t-f test_mkdir() does not support interop with non DNEII servers
- Resolved