[LU-12769] replay-dual test 0b hangs in client mount Created: 16/Sep/19 Updated: 03/Jan/20 Resolved: 09/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
replay-dual test_0b hangs in when the client tries to mount the file system. Looking at the logs at https://testing.whamcloud.com/test_sets/2165cd40-d70d-11e9-9fc9-52540065bddc, the last lines seen in the suite_log are: Started lustre-MDT0000 Starting client: trevis-16vm11.trevis.whamcloud.com: -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre CMD: trevis-16vm11.trevis.whamcloud.com mkdir -p /mnt/lustre CMD: trevis-16vm11.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre On the MDS (vm11), we see the following in the console log [ 1298.607167] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null [ 1309.346832] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:43 [ 1314.381684] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:38 [ 1314.383960] Lustre: Skipped 1 previous similar message [ 1319.501601] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:33 [ 1324.621716] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:28 [ 1329.741784] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:23 [ 1339.982308] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:12 [ 1339.984388] Lustre: Skipped 1 previous similar message [ 1352.877227] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [ 1352.878085] Lustre: lustre-MDT0000: disconnecting 1 stale clients [ 1360.462321] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) to recover in 0:27 [ 1360.464241] Lustre: Skipped 3 previous similar messages [ 1396.303505] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 0:07 [ 1396.305383] Lustre: Skipped 6 previous similar messages [ 1462.863612] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 1:14 [ 1462.865659] Lustre: Skipped 12 previous similar messages [ 1590.865592] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 3:22 [ 1590.867668] Lustre: Skipped 24 previous similar messages [ 1846.869597] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 7:38 [ 1846.871619] Lustre: Skipped 49 previous similar messages [ 2358.877590] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 16:10 [ 2358.879608] Lustre: Skipped 99 previous similar messages [ 2963.046705] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 26:14 [ 2963.048899] Lustre: Skipped 117 previous similar messages [ 3567.216150] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 36:18 [ 3567.218114] Lustre: Skipped 117 previous similar messages [ 4171.385464] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 46:23 [ 4171.387400] Lustre: Skipped 117 previous similar messages [ 4775.554195] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 56:27 [ 4775.556078] Lustre: Skipped 117 previous similar messages [ 5379.723493] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 66:31 [ 5379.725448] Lustre: Skipped 117 previous similar messages [ 5983.892872] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 76:35 [ 5983.894832] Lustre: Skipped 117 previous similar messages [ 6588.062279] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 86:39 [ 6588.064310] Lustre: Skipped 117 previous similar messages [ 7192.231245] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 96:43 [ 7192.233232] Lustre: Skipped 117 previous similar messages [ 7796.400758] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 106:48 [ 7796.403030] Lustre: Skipped 117 previous similar messages [ 8400.570060] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 116:52 [ 8400.572101] Lustre: Skipped 117 previous similar messages On the OSS (vm4) console log, we see [16003.344264] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4 [16003.724818] Lustre: DEBUG MARKER: trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4 [16009.720340] Lustre: Evicted from MGS (at 10.9.4.245@tcp) after server handle changed from 0xb142ad53c4c0bc64 to 0xb142ad53c4c0d363 [16014.732310] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0001: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff8f95b862c0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [16014.735727] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message This may be related to LU-11038 and/or |
| Comments |
| Comment by Alex Zhuravlev [ 19/Sep/19 ] |
|
in my local testing (single VM) replay-dual 0a timeouts very frequently |
| Comment by Gerrit Updater [ 23/Sep/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36274 |
| Comment by Andreas Dilger [ 23/Sep/19 ] |
|
The broken code fixed by Alex's patch was introduced by commit v2_12_53-53-g9334f1d512 (patch https://review.whamcloud.com/34710 " |
| Comment by James A Simmons [ 23/Sep/19 ] |
|
Actually this was broken before |
| Comment by James A Simmons [ 01/Oct/19 ] |
|
Talking with Alex this appears to a problem of clock drift which would explain why older tickets similar to this exist. The reason being that jiffies also experiences clock drift as well. In the case of the hrtimer the clock drift appears to only happen when using the high resolution wall clock on VMs. It appears using the high resolution monotonic clock avoids this problem. |
| Comment by Gerrit Updater [ 09/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36274/ |
| Comment by Peter Jones [ 09/Oct/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36937 |
| Comment by Gerrit Updater [ 03/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36937/ |