Details

    • 3
    • 9223372036854775807

    Description

      The sanity.sh test_27m and sanity.sh test_64b, and sanityn.sh test_15 subtests are almost always skipped during testing because they try to fill up an OST completely, and usually the OST size is too large to fill with "dd". However, it is possible to fill an OST very quickly with "fallocate" (within 100 MB) and then use "dd" to handle the last bit of writing to better simulate normal file IO.

      It would also be useful to merge the oos.sh and oos2.sh scripts to avoid code duplication.

      Attachments

        Issue Links

          Activity

            [LU-17224] improve OST out-of-space testing
            pjones Peter Jones added a comment -

            Second patch merged

            pjones Peter Jones added a comment - Second patch merged

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/59105/
            Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d37c231995c75d039b93c718f8fe76c41540f357

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/59105/ Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS Project: fs/lustre-release Branch: master Current Patch Set: Commit: d37c231995c75d039b93c718f8fe76c41540f357

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59105
            Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c7b2d4f7a9be26f2fde64fbf1b62fb58d21e6852

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59105 Subject: LU-17224 tests: reduce MAXFREE in oos2.sh for ZFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c7b2d4f7a9be26f2fde64fbf1b62fb58d21e6852

            I think this test should just be disabled on ZFS for large volumes again. 

            The intent was to change the ldiskfs testing to use fallocate to speed up testing, but unintentionally it changed the limit from 400MB x OSTCOUNT to 1TB x OSTCOUNT for the filesystem.  This is almost certainly going to fill up $TMP if that is where the ZFS OSTs are located, because ZFS doesn't support fallocate and will try to write the full data size, and you don't have 4TB of RAM on your test system...

            I don't think there is any mystery here.

            adilger Andreas Dilger added a comment - I think this test should just be disabled on ZFS for large volumes again.  The intent was to change the ldiskfs testing to use fallocate to speed up testing, but unintentionally it changed the limit from 400MB x OSTCOUNT to 1TB x OSTCOUNT for the filesystem.  This is almost certainly going to fill up $TMP if that is where the ZFS OSTs are located, because ZFS doesn't support fallocate and will try to write the full data size, and you don't have 4TB of RAM on your test system... I don't think there is any mystery here.

            OK, locally this problem can be solved with:

            mount -o remount,size=90% /tmp 

            to let tmpfs use more memory.
            need to understand the problem with ZFS reported above.

            bzzz Alex Zhuravlev added a comment - OK, locally this problem can be solved with: mount -o remount,size=90% /tmp to let tmpfs use more memory. need to understand the problem with ZFS reported above.

            sanityn/15 with zfs fails in AT as well:
            https://testing.whamcloud.com/test_sets/64cd026c-f682-4623-b89a-670506e7720d

            [10939.570286] Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ===================================================================== 22:15:48 (1745878548)
            [10940.836112] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
            [10942.812494] Lustre: DEBUG MARKER: cat /etc/system-release
            [10943.078516] Lustre: DEBUG MARKER: test -r /etc/os-release
            [10943.353119] Lustre: DEBUG MARKER: cat /etc/os-release
            [11076.964479] Lustre: ll_ost_io00_011: service thread pid 271087 was inactive for 40.737 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [11076.964515] Pid: 271088, comm: ll_ost_io00_012 4.18.0-553.46.1.el8_lustre.x86_64 #1 SMP Thu Apr 10 07:27:37 UTC 2025
            [11076.972741] Lustre: Skipped 1 previous similar message
            [11076.975352] Call Trace TBD:
            [11076.975627] [<0>] cv_wait_common+0xaf/0x130 [spl]
            [11076.978302] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs]
            [11076.979858] [<0>] txg_wait_synced+0xc/0x40 [zfs]
            [11076.981202] [<0>] dmu_tx_wait+0x208/0x3f0 [zfs]
            [11076.982521] [<0>] dmu_tx_assign+0x157/0x4d0 [zfs]
            [11076.983927] [<0>] osd_trans_start+0xb8/0x450 [osd_zfs]
            [11076.985363] [<0>] ofd_write_attr_set+0x18c/0x1290 [ofd]
            [11076.986807] [<0>] ofd_commitrw_write+0x200/0x1d80 [ofd]
            [11076.988192] [<0>] ofd_commitrw+0x5f6/0xda0 [ofd]
            [11076.989452] [<0>] obd_commitrw+0x15e/0x2a0 [ptlrpc]
            [11076.991431] [<0>] tgt_brw_write+0xf86/0x1f80 [ptlrpc]
            [11076.992901] [<0>] tgt_request_handle+0x3f4/0x1b80 [ptlrpc]
            [11076.994418] [<0>] ptlrpc_server_handle_request+0x27b/0xcd0 [ptlrpc]
            [11076.996121] [<0>] ptlrpc_main+0xc81/0x1560 [ptlrpc]
            [11076.997509] [<0>] kthread+0x134/0x150
            [11076.998499] [<0>] ret_from_fork+0x35/0x40
            

            it's good that the test exposes this problem though.

            bzzz Alex Zhuravlev added a comment - sanityn/15 with zfs fails in AT as well: https://testing.whamcloud.com/test_sets/64cd026c-f682-4623-b89a-670506e7720d [10939.570286] Lustre: DEBUG MARKER: == sanityn test 15: test out-of-space with multiple writers ===================================================================== 22:15:48 (1745878548) [10940.836112] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/ null [10942.812494] Lustre: DEBUG MARKER: cat /etc/system-release [10943.078516] Lustre: DEBUG MARKER: test -r /etc/os-release [10943.353119] Lustre: DEBUG MARKER: cat /etc/os-release [11076.964479] Lustre: ll_ost_io00_011: service thread pid 271087 was inactive for 40.737 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [11076.964515] Pid: 271088, comm: ll_ost_io00_012 4.18.0-553.46.1.el8_lustre.x86_64 #1 SMP Thu Apr 10 07:27:37 UTC 2025 [11076.972741] Lustre: Skipped 1 previous similar message [11076.975352] Call Trace TBD: [11076.975627] [<0>] cv_wait_common+0xaf/0x130 [spl] [11076.978302] [<0>] txg_wait_synced_impl+0xc6/0x110 [zfs] [11076.979858] [<0>] txg_wait_synced+0xc/0x40 [zfs] [11076.981202] [<0>] dmu_tx_wait+0x208/0x3f0 [zfs] [11076.982521] [<0>] dmu_tx_assign+0x157/0x4d0 [zfs] [11076.983927] [<0>] osd_trans_start+0xb8/0x450 [osd_zfs] [11076.985363] [<0>] ofd_write_attr_set+0x18c/0x1290 [ofd] [11076.986807] [<0>] ofd_commitrw_write+0x200/0x1d80 [ofd] [11076.988192] [<0>] ofd_commitrw+0x5f6/0xda0 [ofd] [11076.989452] [<0>] obd_commitrw+0x15e/0x2a0 [ptlrpc] [11076.991431] [<0>] tgt_brw_write+0xf86/0x1f80 [ptlrpc] [11076.992901] [<0>] tgt_request_handle+0x3f4/0x1b80 [ptlrpc] [11076.994418] [<0>] ptlrpc_server_handle_request+0x27b/0xcd0 [ptlrpc] [11076.996121] [<0>] ptlrpc_main+0xc81/0x1560 [ptlrpc] [11076.997509] [<0>] kthread+0x134/0x150 [11076.998499] [<0>] ret_from_fork+0x35/0x40 it's good that the test exposes this problem though.

            pjones with this patch landed local testing is broken.

            bzzz Alex Zhuravlev added a comment - pjones with this patch landed local testing is broken.
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            People

              vilapa Vikentsi Lapa
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: