[LU-12175] sanity test 208 fails with 'lease broken over recovery' Created: 09/Apr/19  Updated: 10/Feb/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

DNE


Issue Links:
Duplicate
is duplicated by LU-12171 sanity test_133g: Timeout occurred af... Resolved
is duplicated by LU-12176 recovery-small test 107 fails with 't... Resolved
Related
is related to LU-8466 sanity test_208: @@@@@@ FAIL: lease b... Resolved
is related to LU-11636 t-f test_mkdir() does not support int... Resolved
is related to LU-12210 Test failures associated with DNE ind... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_208 fails in test 4: lease can sustain over recovery with 'lease broken over recovery'. These failures started on 8 April 2019. So far, this test is only failing for DNE testing.

Looking at the test_suite log for a recent failure, https://testing.whamcloud.com/test_sets/03bcb85a-5ace-11e9-9720-52540065bddc , we see

== sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
==== test 1: verify get lease work
read lease(1) has applied.
==== test 2: verify lease can be broken by upcoming open
no lease applied.
==== test 3: verify lease can't be granted if an open already exists
multiop: cannot get READ lease, ext 0: Device or resource busy (16)
multiop: apply/unlock lease error: Device or resource busy
==== test 4: lease can sustain over recovery
Failing mds1 on trevis-40vm9
…
trevis-40vm6: trevis-40vm6.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
trevis-40vm7: trevis-40vm7.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
trevis-40vm7: CMD: trevis-40vm7.trevis.whamcloud.com lctl get_param -n at_max
trevis-40vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
trevis-40vm6: CMD: trevis-40vm6.trevis.whamcloud.com lctl get_param -n at_max
trevis-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
multiop: expect lease exists
no lease applied.
 sanity test_208: @@@@@@ FAIL: lease broken over recovery 

Looking at the MDS1, 3 (vm9) dmesg log, we see some errors,

[ 7777.539199] Lustre: DEBUG MARKER: == sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
[ 7780.799167] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[ 7781.106991] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 7781.266880] Lustre: Failing over lustre-MDT0000
[ 7781.267747] Lustre: Skipped 1 previous similar message
[ 7781.457907] Lustre: lustre-MDT0000: Not available for connect from 10.9.3.150@tcp (stopping)
[ 7781.459369] Lustre: Skipped 19 previous similar messages
[ 7781.460162] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
[ 7781.460167] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[ 7781.460168] Lustre: Skipped 1 previous similar message
[ 7781.612729] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.234@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 7781.615597] LustreError: Skipped 5 previous similar messages
[ 7781.631586] Lustre: server umount lustre-MDT0000 complete
[ 7782.702609] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
[ 7783.012452] Lustre: DEBUG MARKER: modprobe dm-flakey;
			 dmsetup targets | grep -q flakey
[ 7793.042085] Lustre: 17623:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554811273/real 1554811273]  req@ffff8bf39b3e0900 x1630329757439344/t0(0) o400->MGC10.9.3.149@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1554811280 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 7793.046530] LustreError: 166-1: MGC10.9.3.149@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[ 7793.384485] Lustre: DEBUG MARKER: hostname
[ 7793.752701] Lustre: DEBUG MARKER: modprobe dm-flakey;
			 dmsetup targets | grep -q flakey
[ 7794.077024] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
[ 7794.380875] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
[ 7794.685614] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
[ 7794.986331] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
[ 7795.394940] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre   /dev/mapper/mds1_flakey /mnt/lustre-mds1
[ 7795.730148] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 7795.756692] Lustre: osd-ldiskfs create tunables for lustre-MDT0000
[ 7798.648231] Lustre: MGS: Connection restored to bfee9086-166d-21e0-ae35-a479c8b23d83 (at 10.9.5.234@tcp)
[ 7798.650043] Lustre: Skipped 66 previous similar messages
[ 7799.066744] Lustre: Evicted from MGS (at 10.9.3.149@tcp) after server handle changed from 0x8eaf7145dd9f3b79 to 0x8eaf7145ddb23d36
[ 7799.411459] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 7799.801070] Lustre: 25291:0:(llog.c:606:llog_process_thread()) lustre-MDT0002-osp-MDT0000: invalid length 0 in llog [0x1:0x80000402:0x2]record for index 0/6
[ 7799.804286] LustreError: 25291:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22
[ 7799.808197] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
[ 7799.810304] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 7799.812067] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) Skipped 1 previous similar message
[ 7800.264478] Lustre: lustre-MDT0000: disconnecting 5 stale clients
[ 7800.289476] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0001-mdtlov_UUID (at 10.9.3.150@tcp), waiting for 5 known clients (0 recovered, 0 in progress, and 5 evicted) already passed deadline 130:00
[ 7800.596879] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check

Although LU-8466 is for the same test failure with the same error message, there was
Here are links to logs for recent failures:
https://testing.whamcloud.com/test_sets/19290212-5b04-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d9ae92b0-5a76-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/d84fe1d8-5a3c-11e9-8e92-52540065bddc



 Comments   
Comment by Patrick Farrell (Inactive) [ 09/Apr/19 ]

Huh.  Well this is the fatal bit - this llog issue and the evictions resulting in it are going to break this:

[ 7799.801070] Lustre: 25291:0:(llog.c:606:llog_process_thread()) lustre-MDT0002-osp-MDT0000: invalid length 0 in llog [0x1:0x80000402:0x2]record for index 0/6
[ 7799.804286] LustreError: 25291:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22
[ 7799.808197] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
[ 7799.810304] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[ 7799.812067] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) Skipped 1 previous similar message 
Comment by Gerrit Updater [ 10/Apr/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34632
Subject: LU-12175 ldlm: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 23ab13bd556f81c940d6a78d7cf07856d45d4aae

Comment by James A Simmons [ 10/Apr/19 ]

From what I can tell this bug predates the hrtimer patch that landed but the hrtimer really brings out this bug. The strange thing is the LU-11771 patch on 2.12 LTS doesn't produce the same issues. I can't reproduce this locally at all so I pushed a debug patch. I expect that the LU-11771 patch will be reverted but that will only hide what the real problem is.

Comment by James A Simmons [ 11/Apr/19 ]

I would not close this out. Their is a real bug hidden in the code not dealt with yet.

Comment by James A Simmons [ 15/Apr/19 ]

Still seeing this, just not as often.

https://testing.whamcloud.com/test_sessions/7599c2ef-25c8-4479-8e24-4b1fb966502e

Comment by Gerrit Updater [ 15/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34664
Subject: LU-12175 tests: Revert LU-11636
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3bfa57c88c17492c3c985e8079a07ce6750b678

Comment by Gerrit Updater [ 15/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34666
Subject: LU-12175 tests: Revert LU-11566
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 41155e7f9395a098c368a1acbdda06d8334e49eb

Comment by Gerrit Updater [ 17/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34697
Subject: LU-12175 tests: Change to test_mkdir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9409532dfe667d672a1bb002d9e4fc6555c308da

Comment by Gerrit Updater [ 17/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34699
Subject: LU-12175 tests: Partial revert of LU-11636
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d893f34824c5c5472653b19f557d079b66266a06

Comment by Arshad Hussain [ 18/Apr/19 ]

Seen under https://testing.whamcloud.com/test_sessions/8c83753a-320b-42b6-aad1-59b44e9c2dde

Comment by Gerrit Updater [ 18/Apr/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34705
Subject: LU-12175 tests: Partial revert of LU-11636
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 379bd66d6849643e65bea6b688780e4a2e9980b8

Comment by Gerrit Updater [ 18/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34705/
Subject: LU-12175 tests: Partial revert of LU-11636
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 669bcc3fc9ddf12f4c7aee0b347deeb1dd269347

Comment by Peter Jones [ 18/Apr/19 ]

Landed for 2.13

Comment by James Nunez (Inactive) [ 25/Nov/19 ]

I think we're seeing the same issue on master (b2_14). Please see https://testing.whamcloud.com/test_sets/5c065f78-0fd4-11ea-98f1-52540065bddc for logs.

Comment by Alex Zhuravlev [ 12/Feb/20 ]

208 timeouts very often in my testing:
Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 12. Is it stuck?
Lustre: lustre-MDT0000: UNLINKED 00000000ce791f59 lustre-MDT0001-mdtlov_UUID 192.168.122.146@tcp 1 (1 0 0) 1 0 0 0: (null) 42949709990 stale:0
Lustre: lustre-MDT0000: UNLINKED 00000000daedb981 lustre-MDT0000-lwp-MDT0001_UUID 192.168.122.146@tcp 1 (1 0 0) 1 0 0 0: (null) 0 stale:0
Lustre: lustre-MDT0000: UNLINKED 000000007e196a79 lustre-MDT0000-lwp-OST0000_UUID 192.168.122.146@tcp 1 (1 0 0) 1 0 0 0: (null) 0 stale:0
Lustre: lustre-MDT0000: UNLINKED 00000000923c2bde e30080ae-dab6-4 192.168.122.146@tcp 1 (1 0 0) 1 0 0 0: (null) 42949710030 stale:0
Lustre: lustre-MDT0000: UNLINKED 0000000022d6877c lustre-MDT0000-lwp-MDT0000_UUID 192.168.122.146@tcp 1 (1 0 0) 1 0 0 0: (null) 0 stale:0


Comment by James Nunez (Inactive) [ 09/Feb/21 ]

I think we're seeing a similar issue in replay-dual test 23d. Looking at the failure at https://testing.whamcloud.com/test_sets/17948bab-e647-4f32-874a-0fe07a464353 for DNE/ZFS testing, we see the following in the MDT2,4 (vm9) console log

[11816.199115] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-66vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
[11816.837718] Lustre: 180077:0:(llog.c:620:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x401:0x2]record for index 0/9
[11816.841991] LustreError: 180077:0:(lod_dev.c:425:lod_sub_recovery_thread()) lustre-MDT0000-osp-MDT0001 get update log failed: rc = -22
[11817.063950] Lustre: DEBUG MARKER: trevis-66vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
Generated at Sat Feb 10 02:50:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.