[LU-17151] sanity: test_411b Error: '(3) failed to write successfully' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
- failing_tests

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Serguei Smirnov <ssmirnov@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2a3fbe0b-f784-4875-bd67-6ab32aa223a3

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/99011 - 4.18.0-425.10.1.el8_7.aarch64
servers: https://build.whamcloud.com/job/lustre-reviews/99011 - 4.18.0-477.21.1.el8_lustre.x86_64

sanity test_411b: @@@@@@ FAIL: (3) failed to write successfully
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:6700:error()
= /usr/lib64/lustre/tests/sanity.sh:27545:test_411b()
= /usr/lib64/lustre/tests/test-framework.sh:7040:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:7096:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:6926:run_test()
= /usr/lib64/lustre/tests/sanity.sh:27592:main()
Dumping lctl log to /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.*.1695797605.log
CMD: trevis-108vm17.trevis.whamcloud.com,trevis-108vm18,trevis-72vm4,trevis-72vm5,trevis-72vm6 /usr/sbin/lctl dk > /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.debug_log.$(hostname -s).1695797605.log;
dmesg > /autotest/autotest-2/2023-09-27/lustre-reviews_review-ldiskfs-dne-arm_99011_29_0b05909e-9d3c-46a3-9f81-125f5c37cc5d//sanity.test_411b.dmesg.$(hostname -s).1695797605.log
cache 19660800
rss 0
rss_huge 0
shmem 0
mapped_file 0
dirty 2883584
writeback 0
swap 2949120
pgpgin 21206
pgpgout 20906
pgfault 4034
pgmajfault 417
inactive_anon 0
active_anon 0
inactive_file 18022400
active_file 1638400
unevictable 0
hierarchical_memory_limit 268435456
hierarchical_memsw_limit 9223372036854710272
total_cache 19660800
total_rss 0
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 2883584
total_writeback 0
total_swap 2949120
total_pgpgin 21206
total_pgpgout 20906
total_pgfault 4034
total_pgmajfault 417
total_inactive_anon 0
total_active_anon 0
total_inactive_file 18022400
total_active_file 1638400
total_unevictable 0

Attachments

Issue Links

is related to

LU-17183 sanity.sh test_411b: cgroups OOM on ARM

Resolved

is related to

LU-16713 Writeback and commit pages under memory pressure to avoid OOM

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(12 mentioned in)

Activity

[LU-17151] sanity: test_411b Error: '(3) failed to write successfully'

Nikitas Angelinas added a comment - 04/Sep/24 1:31 PM - edited

There is a failure in a branch for master at https://testing.whamcloud.com/test_sets/702d252c-180a-4b1a-a1cd-d8ab56d4d11f, that includes the fixes from this ticket, but I'm not sure if it's due to the same issue?

Nikitas Angelinas added a comment - 04/Sep/24 1:31 PM - edited There is a failure in a branch for master at https://testing.whamcloud.com/test_sets/702d252c-180a-4b1a-a1cd-d8ab56d4d11f , that includes the fixes from this ticket, but I'm not sure if it's due to the same issue?

Peter Jones added a comment - 31/Jul/24 4:14 PM

Merged for 2.16

Peter Jones added a comment - 31/Jul/24 4:14 PM Merged for 2.16

Gerrit Updater added a comment - 31/Jul/24 4:07 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55568/
Subject: ~~LU-17151~~ tests: increase memcg limit on x86_64
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a26b49279c86a08e84215698d5bbc3d1ebf4d939

Gerrit Updater added a comment - 31/Jul/24 4:07 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55568/ Subject: LU-17151 tests: increase memcg limit on x86_64 Project: fs/lustre-release Branch: master Current Patch Set: Commit: a26b49279c86a08e84215698d5bbc3d1ebf4d939

Gerrit Updater added a comment - 28/Jun/24 2:57 AM

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55568
Subject: ~~LU-17151~~ tests: increase memcg limit on x86_64
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78fccbdb60d4686afc1c3d004e069bbd71e1f3ff

Gerrit Updater added a comment - 28/Jun/24 2:57 AM "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55568 Subject: LU-17151 tests: increase memcg limit on x86_64 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 78fccbdb60d4686afc1c3d004e069bbd71e1f3ff

Andreas Dilger added a comment - 25/Jun/24 10:19 PM

The x86_64 limit of 384MB is quite small and could be increased. That said, it shouldn't be too large, or we won't know if cgroups is actually working properly.

Andreas Dilger added a comment - 25/Jun/24 10:19 PM The x86_64 limit of 384MB is quite small and could be increased. That said, it shouldn't be too large, or we won't know if cgroups is actually working properly.

Qian Yingjin added a comment - 04/Jun/24 7:20 AM

11 failures in the past week...

Qian Yingjin added a comment - 04/Jun/24 7:20 AM 11 failures in the past week...

Andreas Dilger added a comment - 28/Jan/24 4:29 AM

22 failures in the past week

Andreas Dilger added a comment - 28/Jan/24 4:29 AM 22 failures in the past week

Alex Zhuravlev added a comment - 10/Jan/24 6:48 AM

Does trim get passed from the VM down to the host to free the memory?

yes, here are my local runs:

sanity on ZFS w/o trim: 9785 MBs allocated from the host by sanity's completion
sanity on ZFS w/ zfs trim in run_one_logged: 8159 MBs

Alex Zhuravlev added a comment - 10/Jan/24 6:48 AM Does trim get passed from the VM down to the host to free the memory? yes, here are my local runs: sanity on ZFS w/o trim: 9785 MBs allocated from the host by sanity's completion sanity on ZFS w/ zfs trim in run_one_logged: 8159 MBs

Andreas Dilger added a comment - 09/Jan/24 8:24 PM

Does trim get passed from the VM down to the host to free the memory? You could try adding a patch to this subtest to run fstrim or zfs trim on all of the targets before or during the test. However, if only the MDT is on tmpfs then I don't think these tests are using much memory there.

It would be useful to add some debugging to see where all of the memory is used (slabinfo/meminfo) so that we can reduce the size. Some of the internal data structures are allocated/limited at mount time based on the total RAM size and not limited by the cgroup size (e.g. lu cache, max_dirty_mb, etc), and the client needs to do a better job to free this memory under pressure (e.g. registered shrinker). Should we do anything to flush the cache at the start of the test, so that the process in the cgroup is not penalized by previous allocations outside its control?

It would also be good to improve the test scripts slightly to match proper test script style:

no need for "trap 0" in cleanup_test411_cgroup() since this would clobber any other registered stack_trap calls
in test_411a use "stack_trap cleanup_test411_cgroup" to do the cleanup instead of calling it explicitly
in test_411b add "stack_trap 'rm -f $DIR/$tfile.*'" to clean up the files even if the test fails

Andreas Dilger added a comment - 09/Jan/24 8:24 PM Does trim get passed from the VM down to the host to free the memory? You could try adding a patch to this subtest to run fstrim or zfs trim on all of the targets before or during the test. However, if only the MDT is on tmpfs then I don't think these tests are using much memory there. It would be useful to add some debugging to see where all of the memory is used (slabinfo/meminfo) so that we can reduce the size. Some of the internal data structures are allocated/limited at mount time based on the total RAM size and not limited by the cgroup size (e.g. lu cache, max_dirty_mb, etc), and the client needs to do a better job to free this memory under pressure (e.g. registered shrinker). Should we do anything to flush the cache at the start of the test, so that the process in the cgroup is not penalized by previous allocations outside its control? It would also be good to improve the test scripts slightly to match proper test script style: no need for " trap 0 " in cleanup_test411_cgroup() since this would clobber any other registered stack_trap calls in test_411a use " stack_trap cleanup_test411_cgroup " to do the cleanup instead of calling it explicitly in test_411b add " stack_trap 'rm -f $DIR/$tfile.*' " to clean up the files even if the test fails

Alex Zhuravlev added a comment - 09/Jan/24 11:43 AM

my observation with tmpfs-based ZFS is that it tends do not reuse blocks instead allocate new ones ("inner") and this leads to memory overuse. can be fixed running "zfs trim .." once few minutes/subtests.

Alex Zhuravlev added a comment - 09/Jan/24 11:43 AM my observation with tmpfs-based ZFS is that it tends do not reuse blocks instead allocate new ones ("inner") and this leads to memory overuse. can be fixed running "zfs trim .." once few minutes/subtests.

Andreas Dilger added a comment - 09/Jan/24 11:33 AM

Yes, it is using tmpfs on the VM host for the MDT since August or so. It is exported to the VM guest as a block device. The OSTs are still HDD I believe.

Andreas Dilger added a comment - 09/Jan/24 11:33 AM Yes, it is using tmpfs on the VM host for the MDT since August or so. It is exported to the VM guest as a block device. The OSTs are still HDD I believe.

People

Assignee:: Qian Yingjin

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 27/Sep/23 2:46 PM

Updated:: 23/Sep/24 2:47 AM

Resolved:: 31/Jul/24 4:14 PM