[LU-11388] replay-single test_131b: test timeout Created: 17/Sep/18 Updated: 20/Jan/24 Resolved: 09/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7, Lustre 2.15.0, Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | failing_tests | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/41a16040-b988-11e8-9df3-52540065bddc test_131b failed with the following error: Timeout occurred after 969 mins, last suite running was replay-single, restarting cluster to continue tests This test starts to fail from tag-2.11.55 == replay-single test 131b: DoM file write replay ==================================================== 11:35:59 (1537011359) CMD: trevis-33vm4 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 CMD: trevis-33vm4 sync; sync; sync UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 8732 1053972 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 1933276 79708 1731404 4% /mnt/lustre[OST:0] lustre-OST0001_UUID 1933276 25880 1786156 1% /mnt/lustre[OST:1] lustre-OST0002_UUID 1933276 25808 1786228 1% /mnt/lustre[OST:2] lustre-OST0003_UUID 1933276 31488 1780548 2% /mnt/lustre[OST:3] lustre-OST0004_UUID 1933276 41772 1770264 2% /mnt/lustre[OST:4] lustre-OST0005_UUID 1933276 25888 1786148 1% /mnt/lustre[OST:5] lustre-OST0006_UUID 1933276 25840 1786196 1% /mnt/lustre[OST:6] filesystem_summary: 13532932 256384 12426944 2% /mnt/lustre CMD: trevis-33vm1.trevis.whamcloud.com,trevis-33vm2 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname) CMD: trevis-33vm1.trevis.whamcloud.com,trevis-33vm2 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi CMD: trevis-33vm4 /usr/sbin/lctl --device lustre-MDT0000 notransno CMD: trevis-33vm4 dmsetup table /dev/mapper/mds1_flakey CMD: trevis-33vm4 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey CMD: trevis-33vm4 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 flakey 252:0 0 0 1800 1 drop_writes\" CMD: trevis-33vm4 dmsetup resume /dev/mapper/mds1_flakey CMD: trevis-33vm4 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by James Nunez (Inactive) [ 19/Sep/18 ] |
|
Mike, Thank you |
| Comment by Mikhail Pershin [ 28/Sep/18 ] |
|
This timeout happens when IO switches to sync mode which cannot be replayed properly with replay_barrier. The reason for that switching is not enough grants it seems. I am thinking about possible workaround. So it is test issue at the moment |
| Comment by Gerrit Updater [ 03/Oct/18 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33279 |
| Comment by Mikhail Pershin [ 03/Oct/18 ] |
|
Disable test because of unstable behavior, while this is test issue, it still may signal about possible bug in MDC-MDT grants code |
| Comment by Gerrit Updater [ 10/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33279/ |
| Comment by Gerrit Updater [ 27/Oct/20 ] |
|
Vikentsi Lapa (vlapa@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40421 |
| Comment by James Nunez (Inactive) [ 16/Jun/21 ] |
|
Just a note that we are still seeing replay-single test 131b timeout for ldiskfs servers on master on branch testing, full test group, https://testing.whamcloud.com/test_sets/7322774c-afbc-4e20-ac2e-7b86cfcf251c and for failover testing https://testing.whamcloud.com/test_sets/57e50c3e-e790-4966-b834-ccc00fa41a81. The patch above only disables this test for ZFS servers. |
| Comment by Alena Nikitenko [ 03/Dec/21 ] |
|
Found something similar on 2.12.8 testing: https://testing.whamcloud.com/test_sets/c76df3be-faea-4f20-bb36-f44818a6a7bf == replay-single test 131b: DoM file write replay ==================================================== 13:33:22 (1637415202) CMD: onyx-109vm10 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 CMD: onyx-109vm10 sync; sync; sync UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 5781172 3020 5255320 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 1908940 17728 1769720 1% /mnt/lustre[OST:0] lustre-OST0001_UUID 1908940 1332 1786368 1% /mnt/lustre[OST:1] lustre-OST0002_UUID 1908940 1324 1786376 1% /mnt/lustre[OST:2] lustre-OST0003_UUID 1908940 11568 1776132 1% /mnt/lustre[OST:3] lustre-OST0004_UUID 1908940 11568 1776132 1% /mnt/lustre[OST:4] lustre-OST0005_UUID 1908940 11564 1776136 1% /mnt/lustre[OST:5] lustre-OST0006_UUID 1908940 1328 1786372 1% /mnt/lustre[OST:6] filesystem_summary: 13362580 56412 12457236 1% /mnt/lustre CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname) CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi CMD: onyx-109vm10 /usr/sbin/lctl --device lustre-MDT0000 notransno CMD: onyx-109vm10 dmsetup table /dev/mapper/mds1_flakey CMD: onyx-109vm10 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey CMD: onyx-109vm10 dmsetup load /dev/mapper/mds1_flakey --table \"0 20971520 flakey 252:0 0 0 1800 1 drop_writes\" CMD: onyx-109vm10 dmsetup resume /dev/mapper/mds1_flakey CMD: onyx-109vm10 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
'Timeout occurred after 692 mins, last suite running was replay-single'
|
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/40421/ |
| Comment by Peter Jones [ 23/Dec/21 ] |
|
Does this landing mean that this ticket can be closed or does further work remain? |
| Comment by Sergey Cheremencev [ 30/Dec/21 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/a0db5427-6afe-4709-91e7-ed111a3ce01f It failed even with " |
| Comment by Patrick Farrell [ 08/Mar/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/b0074d9c-d7fd-45c1-a5b9-79c61aad0f20 |
| Comment by Etienne Aujames [ 13/Apr/22 ] |
|
+1 on master (ZFS): https://testing.whamcloud.com/test_sets/4c0e7862-d495-4f2e-ab4f-8c30e3a3dc59 |
| Comment by Alex Zhuravlev [ 16/Jan/23 ] |
|
probably this is |
| Comment by Gerrit Updater [ 17/Apr/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50661 |
| Comment by Gerrit Updater [ 09/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50661/ |
| Comment by Peter Jones [ 09/Jun/23 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 12/Jun/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51289 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51289/ |