Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
3
-
9223372036854775807
Description
The sanity.sh test_27m and sanity.sh test_64b, and sanityn.sh test_15 subtests are almost always skipped during testing because they try to fill up an OST completely, and usually the OST size is too large to fill with "dd". However, it is possible to fill an OST very quickly with "fallocate" (within 100 MB) and then use "dd" to handle the last bit of writing to better simulate normal file IO.
It would also be useful to merge the oos.sh and oos2.sh scripts to avoid code duplication.
Attachments
Issue Links
- is related to
-
LU-18955 sanityn test_15: Timeout occurred
-
- Resolved
-
Activity
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/59105/
Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d37c231995c75d039b93c718f8fe76c41540f357
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59105
Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c7b2d4f7a9be26f2fde64fbf1b62fb58d21e6852
I think this test should just be disabled on ZFS for large volumes again.
The intent was to change the ldiskfs testing to use fallocate to speed up testing, but unintentionally it changed the limit from 400MB x OSTCOUNT to 1TB x OSTCOUNT for the filesystem. This is almost certainly going to fill up $TMP if that is where the ZFS OSTs are located, because ZFS doesn't support fallocate and will try to write the full data size, and you don't have 4TB of RAM on your test system...
I don't think there is any mystery here.
OK, locally this problem can be solved with:
mount -o remount,size=90% /tmp
to let tmpfs use more memory.
need to understand the problem with ZFS reported above.
sanityn/15 with zfs fails in AT as well:
https://testing.whamcloud.com/test_sets/64cd026c-f682-4623-b89a-670506e7720d
[10939.570286] Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ===================================================================== 22:15:48 (1745878548) [10940.836112] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null [10942.812494] Lustre: DEBUG MARKER: cat /etc/system-release [10943.078516] Lustre: DEBUG MARKER: test -r /etc/os-release [10943.353119] Lustre: DEBUG MARKER: cat /etc/os-release [11076.964479] Lustre: ll_ost_io00_011: service thread pid 271087 was inactive for 40.737 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [11076.964515] Pid: 271088, comm: ll_ost_io00_012 4.18.0-553.46.1.el8_lustre.x86_64 #1 SMP Thu Apr 10 07:27:37 UTC 2025 [11076.972741] Lustre: Skipped 1 previous similar message [11076.975352] Call Trace TBD: [11076.975627] [<0>] cv_wait_common+0xaf/0x130 [spl] [11076.978302] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs] [11076.979858] [<0>] txg_wait_synced+0xc/0x40 [zfs] [11076.981202] [<0>] dmu_tx_wait+0x208/0x3f0 [zfs] [11076.982521] [<0>] dmu_tx_assign+0x157/0x4d0 [zfs] [11076.983927] [<0>] osd_trans_start+0xb8/0x450 [osd_zfs] [11076.985363] [<0>] ofd_write_attr_set+0x18c/0x1290 [ofd] [11076.986807] [<0>] ofd_commitrw_write+0x200/0x1d80 [ofd] [11076.988192] [<0>] ofd_commitrw+0x5f6/0xda0 [ofd] [11076.989452] [<0>] obd_commitrw+0x15e/0x2a0 [ptlrpc] [11076.991431] [<0>] tgt_brw_write+0xf86/0x1f80 [ptlrpc] [11076.992901] [<0>] tgt_request_handle+0x3f4/0x1b80 [ptlrpc] [11076.994418] [<0>] ptlrpc_server_handle_request+0x27b/0xcd0 [ptlrpc] [11076.996121] [<0>] ptlrpc_main+0xc81/0x1560 [ptlrpc] [11076.997509] [<0>] kthread+0x134/0x150 [11076.998499] [<0>] ret_from_fork+0x35/0x40
it's good that the test exposes this problem though.
the last landed patch breaks local testing:
== sanityn test 15: test out-of-space with multiple writers ===================================================================== 04:33:12 (1745555592) [ 39.413547] Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ===================================================================== 04:33:12 (1745555592) PATH=/mnt/build/lustre/tests/../tests/mpi:/mnt/build/lustre/tests/../tests/racer:/mnt/build/lustre/tests/../../lustre-iokit/sgpdd-survey:/mnt/build/lustre/tests/../tests:/mnt/build/lustre/tests/../utils/gss:/mnt/build/lustre/tests/../utils:/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/mnt/build/lustre/utils:/mnt/build/lustre/tests::/mnt/build/lustre/scripts:/mnt/build/lustre-iokit/mds-survey:/mnt/build/lustre-iokit/obdfilter-survey:/mnt/build/lipe/src:/opt/iozone/bin:/usr/lib64/openmpi/bin: Reading test skip list from /tmp/ltest.config EXCEPT="$EXCEPT 14 55d 78 106" mgs: Rocky Linux release 9.3 (Blue Onyx) MGS_OS_VERSION_ID=9.3 MGS_OS_ID=rocky MGS_OS_VERSION_CODE=151191552 MGS_OS_ID_LIKE=rhel centos fedora rocky mds1: Rocky Linux release 9.3 (Blue Onyx) MDS1_OS_ID=rocky MDS1_OS_VERSION_CODE=151191552 MDS1_OS_ID_LIKE=rhel centos fedora rocky MDS1_OS_VERSION_ID=9.3 ost1: Rocky Linux release 9.3 (Blue Onyx) OST1_OS_VERSION_ID=9.3 OST1_OS_VERSION_CODE=151191552 OST1_OS_ID=rocky OST1_OS_ID_LIKE=rhel centos fedora rocky client: Rocky Linux release 9.3 (Blue Onyx) CLIENT_OS_VERSION_ID=9.3 CLIENT_OS_ID_LIKE=rhel centos fedora rocky CLIENT_OS_ID=rocky CLIENT_OS_VERSION_CODE=151191552 STRIPECOUNT=2 ORIGFREE=4796328 MAXFREE=2097152000 [ 61.914712] loop: Write error at byte offset 2191167488, length 4096. [ 61.915295] loop: Write error at byte offset 2189426688, length 4096. [ 61.915295] blk_print_req_error: 49 callbacks suppressed [ 61.915295] I/O error, dev loop2, sector 4278784 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.915573] I/O error, dev loop2, sector 4276224 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.915759] LustreError: 3415:0:(osc_request.c:2438:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@ffff9a07b47f9580 x1830347671373568/t4294967849(4294967849) o4->lustre-OST0001-osc-ffff9a0745d6c000@0@lo:6/4 lens 488/448 e 0 to 0 dl 1745555630 ref 3 fl Interpret:RQU/604/0 rc -5/-5 job:'dd.0' uid:0 gid:0 projid:0 [ 61.931445] loop: Write error at byte offset 2189164544, length 4096. [ 61.931831] loop: Write error at byte offset 2187853824, length 4096. [ 61.931831] I/O error, dev loop2, sector 4275712 op 0x1:(WRITE) flags 0x0 phys_seg 4 prio class 2 [ 61.931875] loop: Write error at byte offset 2186543104, length 4096. [ 61.931912] I/O error, dev loop2, sector 4273152 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.931999] loop: Write error at byte offset 2185232384, length 4096. [ 61.932170] I/O error, dev loop2, sector 4270592 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.932249] I/O error, dev loop2, sector 4268032 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.960158] loop: Write error at byte offset 2193358848, length 4096. [ 61.960239] loop: Write error at byte offset 2192048128, length 4096. [ 61.960239] I/O error, dev loop1, sector 4283904 op 0x1:(WRITE) flags 0x0 phys_seg 4 prio class 2 [ 61.960279] loop: Write error at byte offset 2190737408, length 4096. [ 61.960523] loop: Write error at byte offset 2189426688, length 4096. [ 61.960523] I/O error, dev loop1, sector 4281344 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.960718] I/O error, dev loop1, sector 4278784 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 61.960756] I/O error, dev loop1, sector 4276224 op 0x1:(WRITE) flags 0x4000 phys_seg 20 prio class 2 [ 62.458045] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@ffff9a079f412e00 x1830347671379840/t4294967773(4294967773) o4->lustre-OST0000-osc-ffff9a0745d6c000@0@lo:6/4 lens 488/448 e 0 to 0 dl 1745555631 ref 3 fl Interpret:RQU/604/0 rc -5/-5 job:'dd.0' uid:0 gid:0 projid:0 [ 62.458240] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) Skipped 20 previous similar messages [ 64.144991] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@ffff9a077f0a9700 x1830347671377152/t4294967870(4294967870) o4->lustre-OST0001-osc-ffff9a074acee000@0@lo:6/4 lens 488/448 e 0 to 0 dl 1745555633 ref 3 fl Interpret:RQU/604/0 rc -5/-5 job:'ptlrpcd_00_01.0' uid:0 gid:0 projid:0 [ 64.145104] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) Skipped 16 previous similar messages [ 65.957220] Aborting journal on device dm-2-8. [ 65.957235] Aborting journal on device dm-1-8. [ 65.957317] LustreError: 5988:0:(osd_handler.c:1880:osd_trans_commit_cb()) transaction @0xffff9a079f7b3200 commit error: 2 [ 66.353992] LDISKFS-fs error (device dm-1): ldiskfs_journal_check_start:83: comm ll_ost_io00_002: Detected aborted journal [ 66.354864] LDISKFS-fs error (device dm-2): ldiskfs_journal_check_start:83: comm ll_ost_io00_000: Detected aborted journal [ 66.354972] LDISKFS-fs (dm-1): Remounting filesystem read-only [ 66.355151] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) @@@ redo for recoverable error -30 req@ffff9a0877cc1200 x1830347682812288/t0(0) o4->lustre-OST0000-osc-ffff9a074acee000@0@lo:6/4 lens 568/464 e 0 to 0 dl 1745555635 ref 2 fl Interpret:RMQU/600/0 rc -30/-30 job:'dd.0' uid:0 gid:0 projid:0 [ 66.355166] LDISKFS-fs (dm-2): Remounting filesystem read-only [ 66.355166] LustreError: 3416:0:(osc_request.c:2438:osc_brw_redo_request()) Skipped 18 previous similar messages ^C
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57260/
Subject: LU-17224 tests: improve OST out-of-space testing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dcd5f6c0a0139c9d0f338509b19d9e4dce01a29e
Second patch merged