[LU-11868] ZFS ea size limited to 32K Created: 16/Jan/19  Updated: 07/Feb/20  Resolved: 30/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Patrick Farrell (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11910 Improve repbuf/easize/mdsize handling... Open
is related to LU-9846 Overstriping - more than stripe per O... Resolved
is related to LU-12481 conf-sanity test_61: setfattr: /mnt/l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The ZFS OSD limits the ea size to DXATTR_MAX_ENTRY_SIZE, which defaults to 32K.

This is done when ddp_max_ea_size is set:

param->ddp_max_ea_size = DXATTR_MAX_ENTRY_SIZE;

 

Per Alex Z., this is probably incorrect, since ZFS can use dedicated objects for EAs.

This was discovered and confirmed in testing overstriping (LU-9846):
https://testing.whamcloud.com/test_sets/ea834356-1552-11e9-9ed8-52540065bddc

Specifically, test 27ci:
This is a test for overstriping, which is > 1 stripe per OST. This tries to create 2000 stripes.
It works on ldiskfs (with ea_inode) enabled, but on ZFS, we only get 1363 total stripes

32768/24 bytes per stripe = 1365

So, minus a little for the rest of the layout EA, this matches.

 

So 32K is too small, especially if we increase the stripe limit to 10K, as the overstriping patch series does in a later patch.  The question is what should the limit be.

 

I would suggest as a possible value:
/* Maximum EA size is limited by LNET_MTU for remote objects */
#define OSD_MAX_EA_SIZE 1048364

Which is currently in the ldiskfs OSD, but is clearly not ldiskfs specific.

I'm curious to get feedback here.



 Comments   
Comment by Patrick Farrell (Inactive) [ 16/Jan/19 ]

The problem is, if we do this, I think we'll hit the same OOM issues in autotest as when we try to enable ea_inode on ldiskfs:
https://review.whamcloud.com/#/c/33706/

Comment there explains a bit:

I have a theory on this that I'm hoping to be able to confirm from the dump.

ea_inode has been the config in many (most?) deployed Cray systems for a few years years, with no issues with OOM.

I think we're seeing OOM not because of a bug, but because various buffers are allocated to maximum ea size, which with ea_inode is ~ 1 MiB for ldiskfs.

I think we're just running the VMs out of memory because of this. The autotest VMs are tiny, ~1.6 GB.

Not sure what to do about that. I'll look at the buffers and see if it's possible to change how they're allocated - there are a bunch of them that depend indirectly on ea_size.

Comment by Patrick Farrell (Inactive) [ 16/Jan/19 ]

One other possibility would be to limit the OSD_MAX_EA_SIZE to some smaller value.  10,000 stripes requires ~ 240K of EA, so we could probably limit it to a 256 KiB size.  But that's arbitrary, not directly connected to any specific limitation.

Comment by Andreas Dilger [ 16/Jan/19 ]

Can you please verify that tools like getfattr, setfattr, cp, rsync, tar, etc. can work with xattrs larger than 32KB or 64KB? AFAIR, there is a hard limit in the kernel for the xattr size that the VFS will even accept, so allowing files with a larger layout internally may cause a lot of problems later.

Comment by Patrick Farrell (Inactive) [ 17/Jan/19 ]

Good to know - I'll check on that.  I suspect we've got a 64 KiB limit - That is in /usr/include/linux/limits.h:

#define XATTR_SIZE_MAX 65536 /* size of an extended attribute value (64k) */

And is used (inconsistently) in the ACL code as an upper limit.

That raises a challenging question, then, if we wish to raise stripe count much beyond 2K (Even 2K is over 32 KiB, so...).  Interesting - I'll noodle on it.

Comment by Patrick Farrell (Inactive) [ 17/Jan/19 ]

Yes, trying to access an ea_size greater than 64K causes E2BIG from getfattr (specifically, the getxattr syscall).

Here's an example with 2,730 stripes, trusted.lov is 65792 bytes:
getxattr("2730file", "trusted.lov", 0x23e2f70, 65792) = -1 E2BIG (Argument list too long)

2720 stripes is fine:

getxattr("2720file", "trusted.lov", [...], 65536) = 65312

I'm going to assume this limit applies to at least some of the other tools as well, so we'll have to respect it.

Just to confirm: This matters for manual editing and backups of MDTs, right?

Comment by Gerrit Updater [ 17/Jan/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34058
Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b730ac71c6461b34b196a90066a38d94d3baf87d

Comment by Gerrit Updater [ 17/Jan/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34059
Subject: LU-11868 mdc: Improve xattr buffer allocations
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 055702a13537c99d7f09364a350d8027359c694e

Comment by Gerrit Updater [ 30/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34059/
Subject: LU-11868 mdc: Improve xattr buffer allocations
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4f78164f8748cf8013331637ba33388e83fbd627

Comment by Gerrit Updater [ 30/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34058/
Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ec712bd183a859a7bb09280b8a5a1776ec5e2c2

Comment by Peter Jones [ 30/Apr/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:47:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.