sanity-benchmark test_iozone: "no space left on device" on ZFS (LU-3522)

[LU-2049] add support for OBD_CONNECT_GRANT_PARAM Created: 28/Sep/12  Updated: 15/Oct/21  Resolved: 24/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.10.0

Type: Technical task Priority: Critical
Reporter: Johann Lombardi (Inactive) Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: HB, llnl, prz

Issue Links:
Blocker
is blocking LU-3522 sanity-benchmark test_iozone: "no spa... Resolved
Duplicate
is duplicated by LU-7970 intermittent ENOSPC on osd_write_commit Resolved
is duplicated by LU-1507 Fix confusing code in obd_connect_data Resolved
Related
is related to LU-7510 (vvp_io.c:1088:vvp_io_commit_write())... Resolved
is related to LU-7803 sanity test 78 failures in interop Resolved
is related to LU-8007 Kernel: LustreError: 191208:0:(vvp_io... Resolved
Rank (Obsolete): 4278

 Description   

Currently, grant is still inflated if backend block size > page size (that's the case with zfs osd).
OBD_CONNECT_GRANT_PARAM was added to address this and we need to develop the the code in osc & ofd to implement support for this feature.



 Comments   
Comment by Andreas Dilger [ 22/Aug/13 ]

To further clarify - the whole reason that there is grant inflation with ZFS is because clients currently only consume grant in PAGE_SIZE chunks (i.e. typically 4kB units), since there are no native Linux filesystems with blocksize > PAGE_SIZE. This problem could also be hit with IA64/PPC/SPARC servers with PAGE_SIZE = 64kB and ext4 having data blocks this large, or with ext4's "bigalloc" feature, so this is not necessarily a ZFS-only bug, just that ZFS is the first OSD used with larger blocksize.

Having server blocksize > client PAGE_SIZE means (in the worst case) if some client is writing sparse PAGE_SIZE chunks into files, the client's smaller write might consume a full block of space on the OST. Without the server-side grant inflation this could lead to the OST incorrectly running out of space before the client runs out of grant, and lose writeback cached data on the client.

In order to fix this problem, client RPCs need to be modified to consume grant in units as given by ocd_inodespace, ocd_blocksize, and ocd_grant_extent when OBD_CONNECT_GRANT_PARAM is set. For each object modified by a write, ocd_inodespace is consumed. For data, the minimum chunk is sized and aligned on ocd_blocksize. Additionally, for each discontiguous extent of data (including the first one) consume ocd_grant_extent worth of space.

I tried printing out the current values assigned to these fields for a ZFS filesystem using http://review.whamcloud.com/6588, but this showed all of these fields as zero on the client, even after removing the OBD_CONNECT_GRANT_PARAM conditional check, so it looks like some work is needed on the server as well.

Comment by Andreas Dilger [ 22/Aug/13 ]

Johann Lombardi previously wrote, and I replied:

Here is a short summary of the grant work status:

1) the patch to estimate the overhead on the server-side (as discussed in Breckenridge) is landed. We have this in place for ldiskfs-osd.
2) While testing this patch, i have found that even for ldiskfs, the overhead estimate can sometimes be defeated, that's ORI-237. The problem is that we need a per-fragment overhead which only the client can add.

It makes sense to add some larger reservation even for ldiskfs, which can only be accessed by sync writes when the filesystem is nearly full. 1% is not unreasonable to reserve. Performance with a 99% full filesystem will already suck, so forcing the client to do sync RPCs is not any worse... It would be good if the client doing sync RPCs would still send a full-sized RPC if the write() syscall was large enough, rather than doing it 1 page at a time, but that is not related to grant, per se.

3) While chatting with Alex & Andreas, i realized that ZFS reports a blocksize of 1MB which is > PAGE_SIZE. This causes a problem because the lustre client consumes PAGE_SIZE of grant space when dirtying one page, but this might end up consuming 1MB instead of 4KB on the backend filesystem (a factor of 256).

ext4 will soon also get 1MB+ block sizes, courtesy of Google. There is a feature called "bigalloc" which increases the "cluster" allocation unit to be a power-of-two multiple of the blocksize (still 4kB limited by PAGE_SIZE). This reduces overhead from bitmap searching/modification when doing large IOs, at the expense of wasting space within each cluster for smaller writes.

The cluster size of the filesystem is fixed at format time and is constant for the whole filesystem, but that is fine for Lustre, since we already separate metadata from the data so we won't have problems with each directory consuming 1MB of space. The OST object directories will still consume space in 1MB chunks, but that is fine because we expect a million files to be created in each directory.

4) The 1.8 & 2.x client code has some basic support for blocksize < PAGE_SIZE (it moves the over-consumed space to lost_grant). The problem is that this code goes mad if the server reports a blocksize > PAGE_SIZE.

To address problem #2, #3 & #4, i am working on a patch set to implement the following:

  • add a new connect flag OBD_CONNECT_LARGE_BSIZE. When this flag is set:
  • the server reports the blocksize and the per-fragment overhead at connect time

Is there an overhead beyond just rounding up the consumed grant to the blocksize? Do we want to take indirect, etc. blocks into account on the client, or is that entirely handled on the server in the grant overhead? The only other amount I can think of is the per-inode space consumption, which is 0 for ldiskfs (due to static allocation and write-in-place) but non-zero for COW filesystems like ZFS with CROW. If we are adding new connect fields, we may as well add both under a single OBD_CONNECT flag.

Also, it seems that using 2 __u8 fields should be enough, since the blocksize and inode size is always going to be a power-of-two value, so sending the log2 value is enough, and that allows block/inode sizes up to 2^256 bytes in size. Since we are near the end of the fields in obd_connect_data that 1.8 can easily use, I'd prefer to use them sparingly for any feature that needs compatibility with 1.8 or 2.0. The >2GB object size patch adds more space to obd_connect_data in a compatible way, but they will only be usable for patched clients, so I'd rather avoid that complexity if not needed.

  • the client is able to consume grant more intelligently, taking into account the blocksize and the number of fragments (e.g. writting 256 contiguous pages consume 1MB + per-fragment overhead)
  • when the client does not support this flag, there are 2 options (can be toogled through /proc):
  • the default behavior is to emulate a 4KB blocksize (to address problem #4) and the server uses a grant inflation assuming that the client consumes 4KB of grant per OSD block (to address problem #3).
    OR
  • the server refuses to grant space to those clients (disable writeback cache).
Comment by Andreas Dilger [ 18/Sep/13 ]

I don't think Oleg assigned this to himself on purpose.

Comment by Johann Lombardi (Inactive) [ 28/Sep/13 ]

Draft patch attached here: http://review.whamcloud.com/7793

Comment by Sarah Liu [ 21/Oct/14 ]

Here is the result for unpatched server(ldiskfs) and patched client, with sanity test_64b and sanityn test_15 enabled:
I will change the test parameters and test other configurations and keep you updated.

https://testing.hpdd.intel.com/test_sessions/ec4adf00-58eb-11e4-9a9c-5254006e85c2

result for unpatched server(zfs) and patched client:
https://testing.hpdd.intel.com/test_sessions/eba4dd82-59d4-11e4-8dbb-5254006e85c2

Comment by Sarah Liu [ 23/Oct/14 ]

Here is the for test only patch of patched server with unpatched client:
http://review.whamcloud.com/#/c/12404/

Comment by Gerrit Updater [ 27/Jan/15 ]

Johann Lombardi (johann.lombardi@intel.com) uploaded a new patch: http://review.whamcloud.com/13531
Subject: LU-2049 grant: delay grant releasing until commit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: edcd417f753864279a439ef8e80eff100fde3f72

Comment by Jinshan Xiong (Inactive) [ 09/Jun/15 ]

I will work on this.

Comment by Andreas Dilger [ 30/Jun/15 ]

Jinshan, are you able to refresh this patch?

Comment by Gerrit Updater [ 14/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13531/
Subject: LU-2049 grant: delay grant releasing until commit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 31b7404f436241436fb0abdec2b6cd678c674d82

Comment by Andreas Dilger [ 03/Dec/15 ]

Patch http://review.whamcloud.com/7793 needs to be refreshed and landed.

Comment by Andreas Dilger [ 03/Dec/15 ]

The main goal of this patch is to reduce the grant over-provisioning for clients that do not understand large blocks on ZFS. It would be useful to run a manual test, or better to write a sanity subtest that compares the grant on the client with the grant on the server to ensure they roughly match rather than being inflated by a factor of (128/4).

For ZFS OSTs the test should be skipped if this feature is not available on the OSC file:

        [ "$(facet_fstype ost1)" = "ZFS" ] && $LCTL get_param osc.$FSNAME-OST0000*.import |
                grep -q "connect_flags:.*grant_param" ||
                { skip "grant_param not available" && return }
Comment by Gerrit Updater [ 20/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/7793/
Subject: LU-2049 grant: add support for OBD_CONNECT_GRANT_PARAM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bd1e41672c974b97148b65115185a57ca4b7bbde

Comment by Oleg Drokin [ 23/Feb/16 ]

I filed LU-7803 for a potential interop issue that I am experiencing now.

Comment by Nathaniel Clark [ 23/Feb/16 ]

It looks like everything has landed for this, can this bug be resolved?

Comment by Andreas Dilger [ 25/Feb/16 ]

It doesn't appear that there was a test in the last patch to verify that the new grant code is working properly. I haven't looked in detail whether it is practical to make a test or not, but that should at least be given a few minutes attention before closing the bug.

Comment by Cameron Harr [ 11/Apr/16 ]

We started hitting the symptoms of LU-7510 the last couple week or so, for which this patch is marked as a fix. We're running a 2.5-5 branch. Of our 80 OSTs, we had 32 that were ~90% full and the rest were closer to 65% full. Deactivating those fuller OSTs appears to have worked around the issue for now, though we think it's starting to happen on a sister file system.

Comment by Gerrit Updater [ 01/Aug/16 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/21619
Subject: LU-2049 tests: FOR TEST ONLY GRANT_CHECK
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 64d3edb56be5b8af451a7b6947aad623fccf01ca

Comment by Nathaniel Clark [ 16/Aug/16 ]

After enabling grant checking:
https://testing.hpdd.intel.com/test_sets/275100e8-5ff2-11e6-b5b1-5254006e85c2

All the tests that were checked failed.

Comment by Gerrit Updater [ 07/Mar/17 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25853
Subject: LU-2049 grant: Fix grant interop with pre-GRANT_PARAM clients
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 15044478ce5f96a6bc80d8209e7fa9fed3f1a8a0

Comment by Gerrit Updater [ 24/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25853/
Subject: LU-2049 grant: Fix grant interop with pre-GRANT_PARAM clients
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 03f24e6f786459b3dd8a37ced7fb3842b864613d

Comment by Peter Jones [ 24/May/17 ]

Landed for 2.10

Comment by Gerrit Updater [ 03/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21619/
Subject: LU-2049 tests: Add GRANT_CHECK_LIST to sanity
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44c672e6aca39acbcca2aeb1b5b5b61a45265ce4

Generated at Sat Feb 10 01:21:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.