[LU-11388] replay-single test_131b: test timeout Created: 17/Sep/18  Updated: 20/Jan/24  Resolved: 09/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7, Lustre 2.15.0, Lustre 2.15.3
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: failing_tests

Issue Links:
Related
is related to LU-16478 faulty MDT connection can leak a refe... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/41a16040-b988-11e8-9df3-52540065bddc

test_131b failed with the following error:

Timeout occurred after 969 mins, last suite running was replay-single, restarting cluster to continue tests

This test starts to fail from tag-2.11.55
test log

== replay-single test 131b: DoM file write replay ==================================================== 11:35:59 (1537011359)
CMD: trevis-33vm4 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: trevis-33vm4 sync; sync; sync
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900        8732     1053972   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       79708     1731404   4% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25880     1786156   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25808     1786228   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       31488     1780548   2% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       41772     1770264   2% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25888     1786148   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25840     1786196   1% /mnt/lustre[OST:6]

filesystem_summary:     13532932      256384    12426944   2% /mnt/lustre

CMD: trevis-33vm1.trevis.whamcloud.com,trevis-33vm2 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: trevis-33vm1.trevis.whamcloud.com,trevis-33vm2 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: trevis-33vm4 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: trevis-33vm4 dmsetup table /dev/mapper/mds1_flakey
CMD: trevis-33vm4 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
CMD: trevis-33vm4 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 flakey 252:0 0 0 1800 1 drop_writes\"
CMD: trevis-33vm4 dmsetup resume /dev/mapper/mds1_flakey
CMD: trevis-33vm4 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-single test_131b - Timeout occurred after 969 mins, last suite running was replay-single, restarting cluster to continue tests



 Comments   
Comment by James Nunez (Inactive) [ 19/Sep/18 ]

Mike,
Would you please investigate this time out?

Thank you

Comment by Mikhail Pershin [ 28/Sep/18 ]

This timeout happens when IO switches to sync mode which cannot be replayed properly with replay_barrier. The reason for that switching is not enough grants it seems. I am thinking about possible workaround. So it is test issue at the moment

Comment by Gerrit Updater [ 03/Oct/18 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33279
Subject: LU-11388 test: disable replay-single test_131b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 17ff9307a282d2cfc0d2a753746854c44a649a0e

Comment by Mikhail Pershin [ 03/Oct/18 ]

Disable test because of unstable behavior, while this is test issue, it still may signal about possible bug in MDC-MDT grants code

Comment by Gerrit Updater [ 10/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33279/
Subject: LU-11388 test: disable replay-single test_131b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 02b6b6746af7e032df51001926fe1d59143520da

Comment by Gerrit Updater [ 27/Oct/20 ]

Vikentsi Lapa (vlapa@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40421
Subject: LU-11388 test: enable replay-single test_131b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4ba0faacc5ae81bea5cd3fada7298c5563a4c219

Comment by James Nunez (Inactive) [ 16/Jun/21 ]

Just a note that we are still seeing replay-single test 131b timeout for ldiskfs servers on master on branch testing, full test group, https://testing.whamcloud.com/test_sets/7322774c-afbc-4e20-ac2e-7b86cfcf251c and for failover testing https://testing.whamcloud.com/test_sets/57e50c3e-e790-4966-b834-ccc00fa41a81.

The patch above only disables this test for ZFS servers.

Comment by Alena Nikitenko [ 03/Dec/21 ]

Found something similar on 2.12.8 testing: https://testing.whamcloud.com/test_sets/c76df3be-faea-4f20-bb36-f44818a6a7bf

== replay-single test 131b: DoM file write replay ==================================================== 13:33:22 (1637415202)
CMD: onyx-109vm10 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: onyx-109vm10 sync; sync; sync
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      5781172        3020     5255320   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1908940       17728     1769720   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1908940        1332     1786368   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1908940        1324     1786376   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1908940       11568     1776132   1% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1908940       11568     1776132   1% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1908940       11564     1776136   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1908940        1328     1786372   1% /mnt/lustre[OST:6]

filesystem_summary:     13362580       56412    12457236   1% /mnt/lustre

CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: onyx-109vm10 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: onyx-109vm10 dmsetup table /dev/mapper/mds1_flakey
CMD: onyx-109vm10 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
CMD: onyx-109vm10 dmsetup load /dev/mapper/mds1_flakey --table \"0 20971520 flakey 252:0 0 0 1800 1 drop_writes\"
CMD: onyx-109vm10 dmsetup resume /dev/mapper/mds1_flakey
CMD: onyx-109vm10 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 
'Timeout occurred after 692 mins, last suite running was replay-single'
Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/40421/
Subject: LU-11388 test: enable replay-single test_131b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cb3b2bb683ce128d5d9dacebbe01b23c183cbf4d

Comment by Peter Jones [ 23/Dec/21 ]

Does this landing mean that this ticket can be closed or does further work remain?

Comment by Sergey Cheremencev [ 30/Dec/21 ]

+1 on master: https://testing.whamcloud.com/test_sets/a0db5427-6afe-4709-91e7-ed111a3ce01f

It failed even with "LU-11388 test: enable replay-single test_131b".

Comment by Patrick Farrell [ 08/Mar/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/b0074d9c-d7fd-45c1-a5b9-79c61aad0f20

Comment by Etienne Aujames [ 13/Apr/22 ]

+1 on master (ZFS): https://testing.whamcloud.com/test_sets/4c0e7862-d495-4f2e-ab4f-8c30e3a3dc59

Comment by Alex Zhuravlev [ 16/Jan/23 ]

probably this is LU-16478

Comment by Gerrit Updater [ 17/Apr/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50661
Subject: LU-11388 tests: replay-single/131b to refresh grants
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6dc4285f738158a90c2ff6b6bd3cbc430b580654

Comment by Gerrit Updater [ 09/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50661/
Subject: LU-11388 tests: replay-single/131b to refresh grants
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 384e1e858eef826677bfa6913074a83c4fab37d3

Comment by Peter Jones [ 09/Jun/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 12/Jun/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51289
Subject: LU-11388 tests: replay-single/131b to refresh grants
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 119867a8add693b8dd7165456fefb7327dd4ed02

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51289/
Subject: LU-11388 tests: replay-single/131b to refresh grants
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 653ae754fa93ecf8b9d290675122956eaf63b6af

Generated at Sat Feb 10 02:43:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.