sanity-benchmark test_iozone: "no space left on device" on ZFS
(LU-3522)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Technical task | Priority: | Critical |
| Reporter: | Johann Lombardi (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, llnl, prz | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 4278 | ||||||||||||||||||||||||||||||||||||
| Description |
|
Currently, grant is still inflated if backend block size > page size (that's the case with zfs osd). |
| Comments |
| Comment by Andreas Dilger [ 22/Aug/13 ] |
|
To further clarify - the whole reason that there is grant inflation with ZFS is because clients currently only consume grant in PAGE_SIZE chunks (i.e. typically 4kB units), since there are no native Linux filesystems with blocksize > PAGE_SIZE. This problem could also be hit with IA64/PPC/SPARC servers with PAGE_SIZE = 64kB and ext4 having data blocks this large, or with ext4's "bigalloc" feature, so this is not necessarily a ZFS-only bug, just that ZFS is the first OSD used with larger blocksize. Having server blocksize > client PAGE_SIZE means (in the worst case) if some client is writing sparse PAGE_SIZE chunks into files, the client's smaller write might consume a full block of space on the OST. Without the server-side grant inflation this could lead to the OST incorrectly running out of space before the client runs out of grant, and lose writeback cached data on the client. In order to fix this problem, client RPCs need to be modified to consume grant in units as given by ocd_inodespace, ocd_blocksize, and ocd_grant_extent when OBD_CONNECT_GRANT_PARAM is set. For each object modified by a write, ocd_inodespace is consumed. For data, the minimum chunk is sized and aligned on ocd_blocksize. Additionally, for each discontiguous extent of data (including the first one) consume ocd_grant_extent worth of space. I tried printing out the current values assigned to these fields for a ZFS filesystem using http://review.whamcloud.com/6588, but this showed all of these fields as zero on the client, even after removing the OBD_CONNECT_GRANT_PARAM conditional check, so it looks like some work is needed on the server as well. |
| Comment by Andreas Dilger [ 22/Aug/13 ] |
|
Johann Lombardi previously wrote, and I replied:
It makes sense to add some larger reservation even for ldiskfs, which can only be accessed by sync writes when the filesystem is nearly full. 1% is not unreasonable to reserve. Performance with a 99% full filesystem will already suck, so forcing the client to do sync RPCs is not any worse... It would be good if the client doing sync RPCs would still send a full-sized RPC if the write() syscall was large enough, rather than doing it 1 page at a time, but that is not related to grant, per se.
ext4 will soon also get 1MB+ block sizes, courtesy of Google. There is a feature called "bigalloc" which increases the "cluster" allocation unit to be a power-of-two multiple of the blocksize (still 4kB limited by PAGE_SIZE). This reduces overhead from bitmap searching/modification when doing large IOs, at the expense of wasting space within each cluster for smaller writes. The cluster size of the filesystem is fixed at format time and is constant for the whole filesystem, but that is fine for Lustre, since we already separate metadata from the data so we won't have problems with each directory consuming 1MB of space. The OST object directories will still consume space in 1MB chunks, but that is fine because we expect a million files to be created in each directory.
Is there an overhead beyond just rounding up the consumed grant to the blocksize? Do we want to take indirect, etc. blocks into account on the client, or is that entirely handled on the server in the grant overhead? The only other amount I can think of is the per-inode space consumption, which is 0 for ldiskfs (due to static allocation and write-in-place) but non-zero for COW filesystems like ZFS with CROW. If we are adding new connect fields, we may as well add both under a single OBD_CONNECT flag. Also, it seems that using 2 __u8 fields should be enough, since the blocksize and inode size is always going to be a power-of-two value, so sending the log2 value is enough, and that allows block/inode sizes up to 2^256 bytes in size. Since we are near the end of the fields in obd_connect_data that 1.8 can easily use, I'd prefer to use them sparingly for any feature that needs compatibility with 1.8 or 2.0. The >2GB object size patch adds more space to obd_connect_data in a compatible way, but they will only be usable for patched clients, so I'd rather avoid that complexity if not needed.
|
| Comment by Andreas Dilger [ 18/Sep/13 ] |
|
I don't think Oleg assigned this to himself on purpose. |
| Comment by Johann Lombardi (Inactive) [ 28/Sep/13 ] |
|
Draft patch attached here: http://review.whamcloud.com/7793 |
| Comment by Sarah Liu [ 21/Oct/14 ] |
|
Here is the result for unpatched server(ldiskfs) and patched client, with sanity test_64b and sanityn test_15 enabled: https://testing.hpdd.intel.com/test_sessions/ec4adf00-58eb-11e4-9a9c-5254006e85c2 result for unpatched server(zfs) and patched client: |
| Comment by Sarah Liu [ 23/Oct/14 ] |
|
Here is the for test only patch of patched server with unpatched client: |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Johann Lombardi (johann.lombardi@intel.com) uploaded a new patch: http://review.whamcloud.com/13531 |
| Comment by Jinshan Xiong (Inactive) [ 09/Jun/15 ] |
|
I will work on this. |
| Comment by Andreas Dilger [ 30/Jun/15 ] |
|
Jinshan, are you able to refresh this patch? |
| Comment by Gerrit Updater [ 14/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13531/ |
| Comment by Andreas Dilger [ 03/Dec/15 ] |
|
Patch http://review.whamcloud.com/7793 needs to be refreshed and landed. |
| Comment by Andreas Dilger [ 03/Dec/15 ] |
|
The main goal of this patch is to reduce the grant over-provisioning for clients that do not understand large blocks on ZFS. It would be useful to run a manual test, or better to write a sanity subtest that compares the grant on the client with the grant on the server to ensure they roughly match rather than being inflated by a factor of (128/4). For ZFS OSTs the test should be skipped if this feature is not available on the OSC file: [ "$(facet_fstype ost1)" = "ZFS" ] && $LCTL get_param osc.$FSNAME-OST0000*.import |
grep -q "connect_flags:.*grant_param" ||
{ skip "grant_param not available" && return }
|
| Comment by Gerrit Updater [ 20/Feb/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/7793/ |
| Comment by Oleg Drokin [ 23/Feb/16 ] |
|
I filed |
| Comment by Nathaniel Clark [ 23/Feb/16 ] |
|
It looks like everything has landed for this, can this bug be resolved? |
| Comment by Andreas Dilger [ 25/Feb/16 ] |
|
It doesn't appear that there was a test in the last patch to verify that the new grant code is working properly. I haven't looked in detail whether it is practical to make a test or not, but that should at least be given a few minutes attention before closing the bug. |
| Comment by Cameron Harr [ 11/Apr/16 ] |
|
We started hitting the symptoms of LU-7510 the last couple week or so, for which this patch is marked as a fix. We're running a 2.5-5 branch. Of our 80 OSTs, we had 32 that were ~90% full and the rest were closer to 65% full. Deactivating those fuller OSTs appears to have worked around the issue for now, though we think it's starting to happen on a sister file system. |
| Comment by Gerrit Updater [ 01/Aug/16 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/21619 |
| Comment by Nathaniel Clark [ 16/Aug/16 ] |
|
After enabling grant checking: All the tests that were checked failed. |
| Comment by Gerrit Updater [ 07/Mar/17 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25853 |
| Comment by Gerrit Updater [ 24/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25853/ |
| Comment by Peter Jones [ 24/May/17 ] |
|
Landed for 2.10 |
| Comment by Gerrit Updater [ 03/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21619/ |