Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The ZFS OSD limits the ea size to DXATTR_MAX_ENTRY_SIZE, which defaults to 32K.

      This is done when ddp_max_ea_size is set:

      param->ddp_max_ea_size = DXATTR_MAX_ENTRY_SIZE;

       

      Per Alex Z., this is probably incorrect, since ZFS can use dedicated objects for EAs.

      This was discovered and confirmed in testing overstriping (LU-9846):
      https://testing.whamcloud.com/test_sets/ea834356-1552-11e9-9ed8-52540065bddc

      Specifically, test 27ci:
      This is a test for overstriping, which is > 1 stripe per OST. This tries to create 2000 stripes.
      It works on ldiskfs (with ea_inode) enabled, but on ZFS, we only get 1363 total stripes

      32768/24 bytes per stripe = 1365

      So, minus a little for the rest of the layout EA, this matches.

       

      So 32K is too small, especially if we increase the stripe limit to 10K, as the overstriping patch series does in a later patch.  The question is what should the limit be.

       

      I would suggest as a possible value:
      /* Maximum EA size is limited by LNET_MTU for remote objects */
      #define OSD_MAX_EA_SIZE 1048364

      Which is currently in the ldiskfs OSD, but is clearly not ldiskfs specific.

      I'm curious to get feedback here.

      Attachments

        Issue Links

          Activity

            [LU-11868] ZFS ea size limited to 32K
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34058/
            Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3ec712bd183a859a7bb09280b8a5a1776ec5e2c2

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34058/ Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3ec712bd183a859a7bb09280b8a5a1776ec5e2c2

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34059/
            Subject: LU-11868 mdc: Improve xattr buffer allocations
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4f78164f8748cf8013331637ba33388e83fbd627

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34059/ Subject: LU-11868 mdc: Improve xattr buffer allocations Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f78164f8748cf8013331637ba33388e83fbd627

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34059
            Subject: LU-11868 mdc: Improve xattr buffer allocations
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 055702a13537c99d7f09364a350d8027359c694e

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34059 Subject: LU-11868 mdc: Improve xattr buffer allocations Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 055702a13537c99d7f09364a350d8027359c694e

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34058
            Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b730ac71c6461b34b196a90066a38d94d3baf87d

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34058 Subject: LU-11868 osd: Set max ea size to XATTR_SIZE_MAX Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b730ac71c6461b34b196a90066a38d94d3baf87d

            Yes, trying to access an ea_size greater than 64K causes E2BIG from getfattr (specifically, the getxattr syscall).

            Here's an example with 2,730 stripes, trusted.lov is 65792 bytes:
            getxattr("2730file", "trusted.lov", 0x23e2f70, 65792) = -1 E2BIG (Argument list too long)

            2720 stripes is fine:

            getxattr("2720file", "trusted.lov", [...], 65536) = 65312

            I'm going to assume this limit applies to at least some of the other tools as well, so we'll have to respect it.

            Just to confirm: This matters for manual editing and backups of MDTs, right?

            pfarrell Patrick Farrell (Inactive) added a comment - Yes, trying to access an ea_size greater than 64K causes E2BIG from getfattr (specifically, the getxattr syscall). Here's an example with 2,730 stripes, trusted.lov is 65792 bytes: getxattr("2730file", "trusted.lov", 0x23e2f70, 65792) = -1 E2BIG (Argument list too long) 2720 stripes is fine: getxattr("2720file", "trusted.lov", [...] , 65536) = 65312 I'm going to assume this limit applies to at least some of the other tools as well, so we'll have to respect it. Just to confirm: This matters for manual editing and backups of MDTs, right?

            Good to know - I'll check on that.  I suspect we've got a 64 KiB limit - That is in /usr/include/linux/limits.h:

            #define XATTR_SIZE_MAX 65536 /* size of an extended attribute value (64k) */

            And is used (inconsistently) in the ACL code as an upper limit.

            That raises a challenging question, then, if we wish to raise stripe count much beyond 2K (Even 2K is over 32 KiB, so...).  Interesting - I'll noodle on it.

            pfarrell Patrick Farrell (Inactive) added a comment - Good to know - I'll check on that.  I suspect we've got a 64 KiB limit - That is in /usr/include/linux/limits.h: #define XATTR_SIZE_MAX 65536 /* size of an extended attribute value (64k) */ And is used (inconsistently) in the ACL code as an upper limit. That raises a challenging question, then, if we wish to raise stripe count much beyond 2K (Even 2K is over 32 KiB, so...).  Interesting - I'll noodle on it.

            Can you please verify that tools like getfattr, setfattr, cp, rsync, tar, etc. can work with xattrs larger than 32KB or 64KB? AFAIR, there is a hard limit in the kernel for the xattr size that the VFS will even accept, so allowing files with a larger layout internally may cause a lot of problems later.

            adilger Andreas Dilger added a comment - Can you please verify that tools like getfattr, setfattr, cp, rsync, tar, etc. can work with xattrs larger than 32KB or 64KB? AFAIR, there is a hard limit in the kernel for the xattr size that the VFS will even accept, so allowing files with a larger layout internally may cause a lot of problems later.

            One other possibility would be to limit the OSD_MAX_EA_SIZE to some smaller value.  10,000 stripes requires ~ 240K of EA, so we could probably limit it to a 256 KiB size.  But that's arbitrary, not directly connected to any specific limitation.

            pfarrell Patrick Farrell (Inactive) added a comment - One other possibility would be to limit the OSD_MAX_EA_SIZE to some smaller value.  10,000 stripes requires ~ 240K of EA, so we could probably limit it to a 256 KiB size.  But that's arbitrary, not directly connected to any specific limitation.

            The problem is, if we do this, I think we'll hit the same OOM issues in autotest as when we try to enable ea_inode on ldiskfs:
            https://review.whamcloud.com/#/c/33706/

            Comment there explains a bit:

            I have a theory on this that I'm hoping to be able to confirm from the dump.

            ea_inode has been the config in many (most?) deployed Cray systems for a few years years, with no issues with OOM.

            I think we're seeing OOM not because of a bug, but because various buffers are allocated to maximum ea size, which with ea_inode is ~ 1 MiB for ldiskfs.

            I think we're just running the VMs out of memory because of this. The autotest VMs are tiny, ~1.6 GB.

            Not sure what to do about that. I'll look at the buffers and see if it's possible to change how they're allocated - there are a bunch of them that depend indirectly on ea_size.

            pfarrell Patrick Farrell (Inactive) added a comment - The problem is, if we do this, I think we'll hit the same OOM issues in autotest as when we try to enable ea_inode on ldiskfs: https://review.whamcloud.com/#/c/33706/ Comment there explains a bit: I have a theory on this that I'm hoping to be able to confirm from the dump. ea_inode has been the config in many (most?) deployed Cray systems for a few years years, with no issues with OOM. I think we're seeing OOM not because of a bug, but because various buffers are allocated to maximum ea size, which with ea_inode is ~ 1 MiB for ldiskfs. I think we're just running the VMs out of memory because of this. The autotest VMs are tiny, ~1.6 GB. Not sure what to do about that. I'll look at the buffers and see if it's possible to change how they're allocated - there are a bunch of them that depend indirectly on ea_size.

            People

              pfarrell Patrick Farrell (Inactive)
              pfarrell Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: