Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Labels:
None
Environment:
DNE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity test_208 fails in test 4: lease can sustain over recovery with 'lease broken over recovery'. These failures started on 8 April 2019. So far, this test is only failing for DNE testing.

Looking at the test_suite log for a recent failure, https://testing.whamcloud.com/test_sets/03bcb85a-5ace-11e9-9720-52540065bddc , we see

== sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
==== test 1: verify get lease work
read lease(1) has applied.
==== test 2: verify lease can be broken by upcoming open
no lease applied.
==== test 3: verify lease can't be granted if an open already exists
multiop: cannot get READ lease, ext 0: Device or resource busy (16)
multiop: apply/unlock lease error: Device or resource busy
==== test 4: lease can sustain over recovery
Failing mds1 on trevis-40vm9
…
trevis-40vm6: trevis-40vm6.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
trevis-40vm7: trevis-40vm7.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
trevis-40vm7: CMD: trevis-40vm7.trevis.whamcloud.com lctl get_param -n at_max
trevis-40vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
trevis-40vm6: CMD: trevis-40vm6.trevis.whamcloud.com lctl get_param -n at_max
trevis-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
multiop: expect lease exists
no lease applied.
 sanity test_208: @@@@@@ FAIL: lease broken over recovery

Looking at the MDS1, 3 (vm9) dmesg log, we see some errors,

[ 7777.539199] Lustre: DEBUG MARKER: == sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
[ 7780.799167] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[ 7781.106991] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 7781.266880] Lustre: Failing over lustre-MDT0000
[ 7781.267747] Lustre: Skipped 1 previous similar message
[ 7781.457907] Lustre: lustre-MDT0000: Not available for connect from 10.9.3.150@tcp (stopping)
[ 7781.459369] Lustre: Skipped 19 previous similar messages
[ 7781.460162] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
[ 7781.460167] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[ 7781.460168] Lustre: Skipped 1 previous similar message
[ 7781.612729] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.234@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 7781.615597] LustreError: Skipped 5 previous similar messages
[ 7781.631586] Lustre: server umount lustre-MDT0000 complete
[ 7782.702609] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
[ 7783.012452] Lustre: DEBUG MARKER: modprobe dm-flakey;
			 dmsetup targets | grep -q flakey
[ 7793.042085] Lustre: 17623:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554811273/real 1554811273]  req@ffff8bf39b3e0900 x1630329757439344/t0(0) o400->MGC10.9.3.149@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1554811280 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 7793.046530] LustreError: 166-1: MGC10.9.3.149@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[ 7793.384485] Lustre: DEBUG MARKER: hostname
[ 7793.752701] Lustre: DEBUG MARKER: modprobe dm-flakey;
			 dmsetup targets | grep -q flakey
[ 7794.077024] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
[ 7794.380875] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
[ 7794.685614] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
[ 7794.986331] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
[ 7795.394940] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre   /dev/mapper/mds1_flakey /mnt/lustre-mds1
[ 7795.730148] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 7795.756692] Lustre: osd-ldiskfs create tunables for lustre-MDT0000
[ 7798.648231] Lustre: MGS: Connection restored to bfee9086-166d-21e0-ae35-a479c8b23d83 (at 10.9.5.234@tcp)
[ 7798.650043] Lustre: Skipped 66 previous similar messages
[ 7799.066744] Lustre: Evicted from MGS (at 10.9.3.149@tcp) after server handle changed from 0x8eaf7145dd9f3b79 to 0x8eaf7145ddb23d36
[ 7799.411459] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 7799.801070] Lustre: 25291:0:(llog.c:606:llog_process_thread()) lustre-MDT0002-osp-MDT0000: invalid length 0 in llog [0x1:0x80000402:0x2]record for index 0/6
[ 7799.804286] LustreError: 25291:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22
[ 7799.808197] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
[ 7799.810304] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 7799.812067] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) Skipped 1 previous similar message
[ 7800.264478] Lustre: lustre-MDT0000: disconnecting 5 stale clients
[ 7800.289476] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0001-mdtlov_UUID (at 10.9.3.150@tcp), waiting for 5 known clients (0 recovered, 0 in progress, and 5 evicted) already passed deadline 130:00
[ 7800.596879] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check

Although ~~LU-8466~~ is for the same test failure with the same error message, there was
Here are links to logs for recent failures:
https://testing.whamcloud.com/test_sets/19290212-5b04-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d9ae92b0-5a76-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d84fe1d8-5a3c-11e9-8e92-52540065bddc

Attachments

Issue Links

is duplicated by

LU-12171 sanity test_133g: Timeout occurred after 161 mins

Resolved

LU-12176 recovery-small test 107 fails with 'test_107 failed with 1'

Resolved

is related to

LU-12210 Test failures associated with DNE index and stripe count randomization

Open

is related to

LU-8466 sanity test_208: @@@@@@ FAIL: lease broken over recovery

Resolved

LU-11636 t-f test_mkdir() does not support interop with non DNEII servers

Resolved

mentioned in: Page Loading...; Page Loading...

(2 mentioned in)

sanity test 208 fails with 'lease broken over recovery'

Details

Description

Attachments

Issue Links

Activity

People

Dates