[LU-12769] replay-dual test 0b hangs in client mount Created: 16/Sep/19  Updated: 03/Jan/20  Resolved: 09/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.4
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11762 replay-single test 0d fails with 'po... Resolved
is related to LU-11771 bad output in target_handle_reconnect... Resolved
is related to LU-10950 replay-single test_0c: post-failover ... Reopened
is related to LU-12340 replay-dual test 0b timeouts Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-dual test_0b hangs in when the client tries to mount the file system. Looking at the logs at https://testing.whamcloud.com/test_sets/2165cd40-d70d-11e9-9fc9-52540065bddc, the last lines seen in the suite_log are:

Started lustre-MDT0000
Starting client: trevis-16vm11.trevis.whamcloud.com:  -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre
CMD: trevis-16vm11.trevis.whamcloud.com mkdir -p /mnt/lustre
CMD: trevis-16vm11.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre

On the MDS (vm11), we see the following in the console log

[ 1298.607167] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
[ 1309.346832] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:43
[ 1314.381684] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:38
[ 1314.383960] Lustre: Skipped 1 previous similar message
[ 1319.501601] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:33
[ 1324.621716] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:28
[ 1329.741784] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:23
[ 1339.982308] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:12
[ 1339.984388] Lustre: Skipped 1 previous similar message
[ 1352.877227] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[ 1352.878085] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[ 1360.462321] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) to recover in 0:27
[ 1360.464241] Lustre: Skipped 3 previous similar messages
[ 1396.303505] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 0:07
[ 1396.305383] Lustre: Skipped 6 previous similar messages
[ 1462.863612] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 1:14
[ 1462.865659] Lustre: Skipped 12 previous similar messages
[ 1590.865592] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 3:22
[ 1590.867668] Lustre: Skipped 24 previous similar messages
[ 1846.869597] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 7:38
[ 1846.871619] Lustre: Skipped 49 previous similar messages
[ 2358.877590] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 16:10
[ 2358.879608] Lustre: Skipped 99 previous similar messages
[ 2963.046705] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 26:14
[ 2963.048899] Lustre: Skipped 117 previous similar messages
[ 3567.216150] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 36:18
[ 3567.218114] Lustre: Skipped 117 previous similar messages
[ 4171.385464] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 46:23
[ 4171.387400] Lustre: Skipped 117 previous similar messages
[ 4775.554195] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 56:27
[ 4775.556078] Lustre: Skipped 117 previous similar messages
[ 5379.723493] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 66:31
[ 5379.725448] Lustre: Skipped 117 previous similar messages
[ 5983.892872] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 76:35
[ 5983.894832] Lustre: Skipped 117 previous similar messages
[ 6588.062279] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 86:39
[ 6588.064310] Lustre: Skipped 117 previous similar messages
[ 7192.231245] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 96:43
[ 7192.233232] Lustre: Skipped 117 previous similar messages
[ 7796.400758] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 106:48
[ 7796.403030] Lustre: Skipped 117 previous similar messages
[ 8400.570060] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 116:52
[ 8400.572101] Lustre: Skipped 117 previous similar messages

On the OSS (vm4) console log, we see

[16003.344264] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
[16003.724818] Lustre: DEBUG MARKER: trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
[16009.720340] Lustre: Evicted from MGS (at 10.9.4.245@tcp) after server handle changed from 0xb142ad53c4c0bc64 to 0xb142ad53c4c0d363
[16014.732310] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0001: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff8f95b862c0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[16014.735727] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message

This may be related to LU-11038 and/or LU-12340.



 Comments   
Comment by Alex Zhuravlev [ 19/Sep/19 ]

in my local testing (single VM) replay-dual 0a timeouts very frequently

Comment by Gerrit Updater [ 23/Sep/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36274
Subject: LU-12769 recovery: restart recovery timer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dd856535f4b98f4edee42a504c62a2766c1557c6

Comment by Andreas Dilger [ 23/Sep/19 ]

The broken code fixed by Alex's patch was introduced by commit v2_12_53-53-g9334f1d512 (patch https://review.whamcloud.com/34710 "LU-11771 ldlm: use hrtimer for recovery to fix timeout messages" so needs to be fixed for 2.13.

Comment by James A Simmons [ 23/Sep/19 ]

Actually this was broken before LU-11771. This is a duplicate of https://jira.whamcloud.com/browse/LU-11762. The landing of LU-11771 reduced how often this bug showed up to the point I couldn't reproduce it to fix it. It might be good to land https://review.whamcloud.com/#/c/35627 as well. We will need this for 2.12 as well.

Comment by James A Simmons [ 01/Oct/19 ]

Talking with Alex this appears to a problem of clock drift which would explain why older tickets similar to this exist. The reason being that jiffies also experiences clock drift as well. In the case of the hrtimer the clock drift appears to only happen when using the high resolution wall clock on VMs. It appears using the high resolution monotonic clock avoids this problem.

Comment by Gerrit Updater [ 09/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36274/
Subject: LU-12769 recovery: use monotonic timer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 06408a4ef381121fa58783026a0cf0a6b0fa479c

Comment by Peter Jones [ 09/Oct/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 05/Dec/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36937
Subject: LU-12769 recovery: use monotonic timer
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: e46386159b7afe349bd90e845d69adafb1ab62fc

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36937/
Subject: LU-12769 recovery: use monotonic timer
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: cf4cdefec8145b2b7a2eb4c2de4e1e08ebc862b3

Generated at Sat Feb 10 02:55:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.