[LU-9384] conf-sanity test 32b fails with 'list verification failed ' Created: 21/Apr/17  Updated: 16/Jan/18  Resolved: 07/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9349 PFL known issues tracking ticket Resolved
is related to LU-9207 Create new conf-sanity test_32 disk i... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After landing the patch for LU-9207, conf-sanity test 32b is failing in a Lustre DNE configuration. In particular, the failures are for the upgrade from disk2_9-ldiskfs.tar.bz2.
From the test log, we can see

CMD: trevis-35vm7 cat /tmp/t32/list
/usr/lib64/lustre/tests
--- /tmp/t32/list.orig	2017-04-21 07:14:02.501123374 +0000
+++ /tmp/t32/list	2017-04-21 07:14:02.587123374 +0000
@@ -22,8 +22,8 @@
 total 0
 total 0
 total 0
-total 10276
-total 80
+total 10280
+total 96
 144115205272502295 -rw-r--r-- 1 60000 60000 10485760 1490901645 t32_qf_old
 144115205272502293 -rw-r--r-- 1 0 0  1160 1488511399 README
 144115205272502292 -rwxr-xr-x 1 0 0 14167 1456757151 powerman
 conf-sanity test_32b: @@@@@@ FAIL: list verification failed 

So far, this looks like it is only impacting review-dne-part-1 test sessions.

conf-sanity test 32b started failing with the ‘list verification’ error on April 17, 2017. Logs for recent failures areat:
https://testing.hpdd.intel.com/test_sets/0821ecfe-2686-11e7-8920-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e18c9b86-2688-11e7-b742-5254006e85c2
https://testing.hpdd.intel.com/test_sets/8fe37e7c-2696-11e7-8920-5254006e85c2



 Comments   
Comment by Sarah Liu [ 21/Apr/17 ]

It seems a lustre upgrade issue. I tried to mount the same disk2_9-ldisk image on both 2.9.0 system and 2.10 system and found

on 2.9, untar the image and mount as loop

[root@onyx-69 2.9]# mount -t lustre -o loop,writeconf,mgsnode=10.2.2.59 /tmp/2.9/mdt /mnt/mds1
[root@onyx-69 2.9]# mount -t lustre -o loop,writeconf,mgsnode=10.2.2.59 /tmp/2.9/ost /mnt/ost1/
[root@onyx-69 2.9]# mount -t lustre 10.2.2.59:/t32fs /mnt/lustre/
[root@onyx-69 2.9]# lfs df /mnt/lustre/
UUID                   1K-blocks        Used   Available Use% Mounted on
t32fs-MDT0000_UUID        125368        1732      114276   1% /mnt/lustre[MDT:0]
t32fs-OST0000_UUID        162888       19732      129156  13% /mnt/lustre[OST:0]

filesystem summary:       162888       19732      129156  13% /mnt/lustre    <---------------------different line

[root@onyx-69 2.9]# ll /mnt/lustre/init.d/
total 80   <-------------------------------------------------------different line
-rw-r--r-- 1 root root 15131 Sep 12  2016 functions
-rwxr-xr-x 1 root root  4608 Dec  7 16:41 lnet
-rwxr-xr-x 1 root root   868 Dec  7 16:41 lsvcgss
-rwxr-xr-x 1 root root 16783 Dec  7 16:41 lustre
-rwxr-xr-x 1 root root  2989 Sep 12  2016 netconsole
-rwxr-xr-x 1 root root  6643 Sep 12  2016 network
-rwxr-xr-x 1 root root 14167 Feb 29  2016 powerman
-rw-r--r-- 1 root root  1160 Mar  2 19:23 README

On 2.10, did the same thing but found the size is different, marked in arrow

[root@onyx-70 ~]# lfs df /mnt/lustre/
UUID                   1K-blocks        Used   Available Use% Mounted on
t32fs-MDT0000_UUID        125368        1892      114116   2% /mnt/lustre[MDT:0]
t32fs-OST0000_UUID        162888       19768      129120  13% /mnt/lustre[OST:0]

filesystem_summary:       162888       19768      129120  13% /mnt/lustre <-------different line

[root@onyx-70 ~]# ll /mnt/lustre/init.d/
total 96   <-------------different line
-rw-r--r-- 1 root root 15131 Sep 12  2016 functions
-rwxr-xr-x 1 root root  4608 Dec  7 16:41 lnet
-rwxr-xr-x 1 root root   868 Dec  7 16:41 lsvcgss
-rwxr-xr-x 1 root root 16783 Dec  7 16:41 lustre
-rwxr-xr-x 1 root root  2989 Sep 12  2016 netconsole
-rwxr-xr-x 1 root root  6643 Sep 12  2016 network
-rwxr-xr-x 1 root root 14167 Feb 29  2016 powerman
-rw-r--r-- 1 root root  1160 Mar  2 19:23 README
Comment by Andreas Dilger [ 21/Apr/17 ]

The values that are different are the allocated filesystem blocks on the OST (in lfs df output) and the allocated file blocks (in ls -l output). I don't think those differences are reasons to consider this test failed, since there may be any number of reasons for this difference (e.g. allocation of internal log files, slight changes to on-disk format, etc).

It would be interesting to find out why these differences exist to see if there is really a bug. Firstly, check "ls -ls" on the directory to see if the block allocation of any file has changed by 16KB, or if several files have changed by 4KB?

A good candidate may also be that the extra PFL xattr data that is being stored on the OST objects is (incorrectly) causing an external xattr block to be created for each file. That would cause a severe performance drop for ldiskfs filesystems upgraded to 2.10 with 256-byte OST inodes and would need to be fixed. We had similar problems many years ago when OSTs first started storing the "fid" xattr on 128-byte OST inodes, causing every file to get an external 4KB xattr block.

To check this, please do the following steps on the OST after the 2.10 upgrade:

  • lfs getstripe /mnt/lustre/init.d/* to get the OST object IDs of those files
  • run debugfs -c /tmp/2.9/ost/ on the device
  • use stat O/0/d(<objid> % 32)/<objid> to list each OST object to see if it has external xattr blocks (shown with a non-zero value for File ACL:)

The main suspect here is patch https://review.whamcloud.com/24882 LU-8998 pfl: enhance PFID EA for PFL which increases the size of the "fid" xattr on the OST objects, but was supposed to fit into the in-inode xattr space.

Comment by Gerrit Updater [ 22/Apr/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/26787
Subject: LU-9384 test: Add conf-sanity 32b,c to ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1bdccfb92110c5148e75b9add4e42e8e89766214

Comment by Gerrit Updater [ 23/Apr/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26789
Subject: LU-9384 tests: skip 2.9 filesystem images on 2.10
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 74dc91c14513227b75958357030115b95d9a79bf

Comment by Gerrit Updater [ 23/Apr/17 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/26789/
Subject: LU-9384 tests: skip 2.9 filesystem images on 2.10
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bfa524f46b1506919fb6dc5aa68679568a5f04a3

Comment by Andreas Dilger [ 24/Apr/17 ]

Patch landed has only skipped the failing test, ticket should not be closed yet.

Any patches to fix this issue need to remove the change from patch https://review.whamcloud.com/26789 so that the 2.9 images are used.

Comment by Sarah Liu [ 24/Apr/17 ]

Running ls -ls on both 2.9 and 2.10 got following. there are 4 files changed by 4KB:
on 2.9

[root@onyx-69 init.d]# ls -ls
total 80
16 -rw-r--r-- 1 root root 15131 Sep 12  2016 functions
 8 -rwxr-xr-x 1 root root  4608 Dec  7 16:41 lnet
 4 -rwxr-xr-x 1 root root   868 Dec  7 16:41 lsvcgss    <-----------
20 -rwxr-xr-x 1 root root 16783 Dec  7 16:41 lustre
 4 -rwxr-xr-x 1 root root  2989 Sep 12  2016 netconsole
 8 -rwxr-xr-x 1 root root  6643 Sep 12  2016 network <-----------
16 -rwxr-xr-x 1 root root 14167 Feb 29  2016 powerman  <-----------
 4 -rw-r--r-- 1 root root  1160 Mar  2 19:23 README  <--------------

on 2.10

[root@onyx-70 init.d]# ls -ls
total 96
16 -rw-r--r-- 1 root root 15131 Sep 12  2016 functions
 8 -rwxr-xr-x 1 root root  4608 Dec  7 16:41 lnet
 8 -rwxr-xr-x 1 root root   868 Dec  7 16:41 lsvcgss
20 -rwxr-xr-x 1 root root 16783 Dec  7 16:41 lustre
 4 -rwxr-xr-x 1 root root  2989 Sep 12  2016 netconsole
12 -rwxr-xr-x 1 root root  6643 Sep 12  2016 network
20 -rwxr-xr-x 1 root root 14167 Feb 29  2016 powerman
 8 -rw-r--r-- 1 root root  1160 Mar  2 19:23 README

Then on 2.10, for these 4 changed files, the File ACL shows non-zero value. For example for file lsvcgss:

/mnt/lustre/init.d/lsvcgss
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     0	            15	          0xf	             0

debugfs:  stat O/0/d15/15
Inode: 154   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 2932758177    Version: 0x00000001:0000004c
User:     0   Group:     0   Project:     0   Size: 868
File ACL: 8398    Directory ACL: 0
Links: 1   Blockcount: 16
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x58dd5aa9:00000000 -- Thu Mar 30 12:21:13 2017
 atime: 0x58dd5aa9:00000000 -- Thu Mar 30 12:21:13 2017
 mtime: 0x5848ac2e:00000000 -- Wed Dec  7 16:41:18 2016
crtime: 0x58dd5a9d:a2bcc980 -- Thu Mar 30 12:21:01 2017
Size of extra inode fields: 32
Extended attributes stored in inode body: 
  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 0f 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x100000000:0xf:0x0] compat=8 incompat=0
EXTENTS:
(0):39936
debugfs:  

unchanged file shows 0, for example file functions

[root@onyx-70 ~]# lfs getstripe /mnt/lustre/init.d/*
/mnt/lustre/init.d/functions
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     0	            12	          0xc	             0

debugfs:  stat O/0/d12/12
Inode: 151   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 2932758174    Version: 0x00000001:00000049
User:     0   Group:     0   Size: 15131
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 32
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x58dd5aa9:00000000 -- Thu Mar 30 12:21:13 2017
 atime: 0x58dd5aa9:00000000 -- Thu Mar 30 12:21:13 2017
 mtime: 0x57d687d9:00000000 -- Mon Sep 12 03:47:53 2016
crtime: 0x58dd5a9d:a2bcc980 -- Thu Mar 30 12:21:01 2017
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 0c 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x100000000:0xc:0x0] compat=8 incompat=0
  fid = "01 04 00 00 02 00 00 00 0f 00 00 00 00 00 00 00 " (16)
  fid: parent=[0x200000401:0xf:0x0] stripe=0
EXTENTS:
(0-3):38912-38915
debugfs:  

Comment by Andreas Dilger [ 24/Apr/17 ]

Hmm, it looks like the core inode size may have also changed in this release due to project quota. The larger inodes have 32 bytes of extended inode attributes:

Size of extra inode fields: 32
Extended attributes stored in inode body: 
  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 0f 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x100000000:0xf:0x0] compat=8 incompat=0

vs. only 28 bytes for the smaller inodes, and the fid xattr still fits in the inode:

Size of extra inode fields: 28
Extended attributes stored in inode body: 
  lma = "08 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 0c 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x100000000:0xc:0x0] compat=8 incompat=0
  fid = "01 04 00 00 02 00 00 00 0f 00 00 00 00 00 00 00 " (16)
  fid: parent=[0x200000401:0xf:0x0] stripe=0
Comment by Andreas Dilger [ 24/Apr/17 ]
struct ext4_inode {
        :
        :
/*80*/  __le16  i_extra_isize;
        __le16  i_checksum_hi;  /* crc32c(uuid+inum+inode) BE */
/*84*/  __le32  i_ctime_extra;  /* extra Change time      (nsec << 2 | epoch) */
/*88*/  __le32  i_mtime_extra;  /* extra Modification time(nsec << 2 | epoch) */
/*8c*/ __le32  i_atime_extra;  /* extra Access time      (nsec << 2 | epoch) */
/*90*/  __le32  i_crtime;       /* File Creation time */
/*94*/  __le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
/*98*/  __le32  i_version_hi;   /* high 32 bits for 64-bit version */
/*9c*/  __le32  i_projid;       /* Project ID */
}; /* = 0xa0 = 160 bytes */

struct ext4_xattr_ibody_header {
        __le32  h_magic;        /* magic number for identification */
};

struct ext4_xattr_entry {
/*00*/  __u8    e_name_len;     /* length of name */
        __u8    e_name_index;   /* attribute name index */
        __le16  e_value_offs;   /* offset in disk block of value */
/*04*/  __le32  e_value_inum;   /* inode in which the value is stored */
/*08*/  __le32  e_value_size;   /* size of attribute value */
/*0c*/  __le32  e_hash;         /* hash value of name and value */
/*10*/  char    e_name[0];      /* attribute name */
}; /* = 0x14 = 20 bytes including 3-byte xattr name */

struct lustre_mdt_attrs {
        /**     
         * Bitfield for supported data in this structure. From enum lma_compat.
         * lma_self_fid and lma_flags are always available.
         */
/*00*/  __u32   lma_compat;
        /**
         * Per-file incompat feature list. Lustre version should support all
         * flags set in this field. The supported feature mask is available in
         * LMA_INCOMPAT_SUPP.
         */
/*04*/ __u32   lma_incompat;   
        /** FID of this inode */
/*08*/  struct lu_fid  lma_self_fid;
}; /* = 0x18 = 24 bytes */

struct lustre_ost_attrs {       
        /* Use lustre_mdt_attrs directly for now, need a common header
         * structure if want to change lustre_mdt_attrs in future. */
/*00*/  struct lustre_mdt_attrs loa_lma;
        
        /* Below five elements are for OST-object's PFID EA, the
         * lma_parent_fid::f_ver is composed of the stripe_count (high 16 bits)
         * and the stripe_index (low 16 bits), the size should not exceed
         * 5 * sizeof(__u64)) to be accessable by old Lustre. If the flag
         * LMAC_STRIPE_INFO is set, then loa_parent_fid and loa_stripe_size
         * are valid; if the flag LMAC_COMP_INFO is set, then the next three
         * loa_comp_* elements are valid. */
/*18*/  struct lu_fid   loa_parent_fid;
/*28*/  __u32           loa_stripe_size;
/*2c*/  __u32           loa_comp_id;
/*30*/  __u64           loa_comp_start;
/*38*/  __u64           loa_comp_end;
}; /* = 0x40 = 64 bytes */

This means we have 256 - 160 - 4 = 92 bytes for xattrs, 20 bytes for the loa ext4_xattr_entry, and 64 bytes for lustre_ost_attrs, and 4 bytes for the NULL separator between the last ext4_xattr_entry and the first value. That should be enough to fit with the packed "lma+fid in a single xattr" format. If the "fid" xattr is using the larger struct filter_fid that contains ff_layout then it will not fit:

struct filter_fid {
       struct lu_fid           ff_parent;
       struct ost_layout       ff_layout;
} __attribute__((packed));
Comment by nasf (Inactive) [ 25/Apr/17 ]

For 2.10 OST 256-bytes on-disk inode space, the usage is as following:
1) 160 (156 + 4 bytes for project quota) bytes used by inode itself
2) 4 bytes for xatter header
3) 20 bytes for LMA EA entry
4) 4 bytes for NULL between xattr entries and values
5) 24 bytes for LMA EA value
6) 40 (5 * sizeof(__u64)) bytes for PFID in LMA EA value
7) 4 bytes unused.

So the 256 bytes should be enough for the OST inode with LMA + PFID + project quota. Missed anything?

Comment by nasf (Inactive) [ 25/Apr/17 ]

For old OST inode with neither PFL nor project quota, the usage is as following:
1) 156 bytes for inode itself
2) 4 bytes for xatter header
3) 20 bytes for LMA EA entry
4) 20 bytes for PFID EA entry
5) 4 bytes for NULL between xattr entries and values
6) 24 bytes for LMA EA value
7) 16 bytes PFID EA value
8) 12 bytes unused.

Comment by nasf (Inactive) [ 25/Apr/17 ]

Here is my local test with disk2_9-ldiskfs.tar.bz2 mounted against the latest master:

root@RHEL6:~/Work/Lustre/L95/lustre-release/lustre/tests/
# ls -ls /mnt/lustre/init.d/
total 80K
4.0K -rw-r--r-- 1 root root 1.2K Mar  3 11:23 README
 16K -rw-r--r-- 1 root root  15K Sep 12  2016 functions
8.0K -rwxr-xr-x 1 root root 4.5K Dec  8 08:41 lnet*
4.0K -rwxr-xr-x 1 root root  868 Dec  8 08:41 lsvcgss*
 20K -rwxr-xr-x 1 root root  17K Dec  8 08:41 lustre*
4.0K -rwxr-xr-x 1 root root 3.0K Sep 12  2016 netconsole*
8.0K -rwxr-xr-x 1 root root 6.5K Sep 12  2016 network*
 16K -rwxr-xr-x 1 root root  14K Feb 29  2016 powerman*

root@RHEL6:~/Work/Lustre/L95/lustre-release/lustre/tests/
# 
Comment by nasf (Inactive) [ 25/Apr/17 ]

The top is:

commit bfa524f46b1506919fb6dc5aa68679568a5f04a3
Author: Andreas Dilger <andreas.dilger@intel.com>
Date:   Sat Apr 22 22:19:00 2017 -0600

    LU-9384 tests: skip 2.9 filesystem images on 2.10
    
    The 2.9 disk images upgraded to 2.10 are causing common test failures
    due to small differences in space usage.  Disable testing these images
    until a proper fix is available.
    
    Test-Parameters: trivial envdefinitions=ONLY=32 testlist=conf-sanity
    Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
    Change-Id: I64d14ac6b079c50ee234ff886c9a80d9213ebbe5
    Reviewed-on: https://review.whamcloud.com/26789
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
Comment by nasf (Inactive) [ 25/Apr/17 ]

So in my local test, the "4.0K -rwxr-xr-x 1 root root 868 Dec 8 08:41 lsvcgss*" is expected result. Have I missed anything?

Comment by Andreas Dilger [ 25/Apr/17 ]

Is this with or without conf-sanity.sh test_32b running? Have you mounted the filesystem and allowed LFSCK to run? Possibly the test_32b test is also accessing or modifying these files? You would need to revert/undo the above patch so that the disk2_9-ldiskfs.tar.bz2 images are used.

Comment by Sarah Liu [ 25/Apr/17 ]

FanYong, I have an environment on onyx-23vm5 with the tip of master #3565, it still shows 96 for init.d, please feel free to use the node

Comment by nasf (Inactive) [ 27/Apr/17 ]

The reason is known now. The image of disk2_9-ldiskfs.tar.bz2 is NOT clean, there are some pending async OST_SETATTR operations on the MDT. So when system mount up, the MDT will send async OST_SETATTR RPCs from the MDT to the OST, one of the target is the "init.d/lsvcgss"'s OST-object. Then it triggers the following calls:

ofd_setattr_hdl() => ofd_attr_set() => osd_attr_set() => ldiskfs_dirty_inode() => ldiskfs_mark_inode_dirty().

Inside the ldiskfs_mark_inode_dirty(), because the on-disk inode for such OST-object is old formatted, means without project quota support, so its size of extra inode fields is 28, but the running kernel is new, support project quota, the default size of extra inode fields is 32. So as following:

int ldiskfs_mark_inode_dirty(handle_t *handle, struct inode *inode)
{
        struct ldiskfs_iloc iloc;
        struct ldiskfs_sb_info *sbi = LDISKFS_SB(inode->i_sb);
        static unsigned int mnt_count;
        int err, ret;

        might_sleep();
        trace_ldiskfs_mark_inode_dirty(inode, _RET_IP_);
        err = ldiskfs_reserve_inode_write(handle, inode, &iloc);
        if (ldiskfs_handle_valid(handle) &&
            LDISKFS_I(inode)->i_extra_isize < sbi->s_want_extra_isize &&
            !ldiskfs_test_inode_state(inode, LDISKFS_STATE_NO_EXPAND)) {
                /*
                 * We need extra buffer credits since we may write into EA block
                 * with this same handle. If journal_extend fails, then it will
                 * only result in a minor loss of functionality for that inode.
                 * If this is felt to be critical, then e2fsck should be run to
                 * force a large enough s_min_extra_isize.
                 */
                if ((jbd2_journal_extend(handle,
                             LDISKFS_DATA_TRANS_BLOCKS(inode->i_sb))) == 0) {
                        ret = ldiskfs_expand_extra_isize(inode,
                                                      sbi->s_want_extra_isize,
                                                      iloc, handle);

Means the async OST_SETATTR RPC trigger ldiskfs_expand_extra_isize(). Such function will enlarge the on-disk inode, and move the EA afterward. On the other hand, the 2.9 image does not support PFL, means the PFID EA and LMA are separated, then such EA moving caused the PFID EA cannot be held inside inode body, and has to be moved to separated block. That is why you saw "File ACL:" is assigned.

Comment by nasf (Inactive) [ 27/Apr/17 ]

One possible solution is to remake the MDT image in disk2_9-ldiskfs.tar.bz2 to cleanup the pending operations (technical debt of LU-9207). That will make conf-sanity test_32b pass Maloo tests. But it seems not a perfect solution. Because for the real upgrade case from Lustre-2.9 to Lustre-2.10, anytime when we modify the existing OST-object, it will cause the ldiskfs to extend inode and allocate new block for PFID EA. That will affect both the disk space and the performance. Maybe we should enhance OI scrub on the OST to merge PFID EA and LMA EA.

Andreas, how do you think?

Comment by Eric Bergland (Inactive) [ 05/May/17 ]

In talking with Andreas issues exposes a significant performance issue for anyone upgrading their file system to PFL, issue needs to be addressed

Comment by Andreas Dilger [ 05/May/17 ]

Fan Yong, I agree that this is a pretty serious problem if all of the existing inodes will bump the xattrs to an external block. We need to be 100% sure whether this is caused by the call to ldiskfs_expand_extra_isize() or later osd-ldiskfs changes to store the extra stripe/component data on the OST object in order to make the correct fix.

I think there are a few different possible solutions:

  • fix ldiskfs to expand the inode properly rather than push the xattrs out of the inode if that is not needed. According to your earlier comments, there should be enough space in the inode even with separate lma/fid xattrs when the project quota field is added. This is something that could even be pushed upstream since this may affect other users.
  • fix ldiskfs so that it doesn't expand the inode if project quota is not enabled if it would cause an xattr to be pushed out of the inode. This could likely also go upstream. I don't expect project quota will be used immediately by many users.
  • change osd-ldiskfs to avoid storing the extra loa component fields if they are zero (i.e. PFL is not in use). This would mean LFSCK doesn't work fully for composite files, but this shouldn't affect existing files anyway since they are not composite files (at least not until FLR adds components to existing plain files)
  • fix LFSCK to merge the lma/fid xattrs on any files on the OST with en external lma/fid if it would fit into the inode space. That won't really be needed if we can avoid it happening in the first place with the previous two fixes, so I'd rather avoid this if it isn't needed.
Comment by nasf (Inactive) [ 08/May/17 ]

It is verified that the ldiskfs_expand_extra_isize_ea() logic is wrong. The original expectation is that such function will move the inode inline EA 4-bytes (32 - 28) afterward. But it handles the input parameter @new_extra_isize as the isize diff directly. Means it tries to move 32 bytes afterward. There is not inode inline enough, so it allocates another EA block, as to the block used by the inode is increased. That is what we observed as described above. I will make ldiskfs patch to fix it.

Comment by Gerrit Updater [ 08/May/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26989
Subject: LU-9384 ldiskfs: handle ldiskfs_expand_extra_isize_ea properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 74fe1ec2f9736bc3f526df83df4f27bb525dbdf7

Comment by Bob Glossman (Inactive) [ 08/May/17 ]

I only see fixes here for el7. what about el6, sles11, sles12?

Comment by nasf (Inactive) [ 08/May/17 ]

project quota is only supported on el7.
Project quota caused the isize changes.

Comment by Andreas Dilger [ 09/May/17 ]

I found that this problem has been fixed in the upstream kernel with the following patches:

commit d0141191a20289f8955c1e03dad08e42e6f71ca9 "ext4: fix xattr shifting when expanding inodes"
commit 418c12d08dc64a45107c467ec1ba29b5e69b0715 "ext4: fix xattr shifting when expanding inodes part 2"
commit 443a8c41cd49de66a3fda45b32b9860ea0292b84 "ext4: properly align shifted xattrs when expanding inodes"
commit 2e81a4eeedcaa66e35f58b81e0755b87057ce392 "ext4: avoid deadlock when expanding inode size"
commit e3014d14a81edde488d9a6758eea8afc41752d2d "ext4: fixup free space calculations when expanding inodes"
commit 94405713889d4a9d341b4ad92956e4e2ec8ec2c2 "ext4: replace bogus assertion in ext4_xattr_shift_entries()"

It looks like there are a number of problems with this code that need to be fixed. They were landed in kernels 4.8 and 4.9.

Comment by Andreas Dilger [ 09/May/17 ]

Yang Sheng, could you please port the referenced patches to RHEL7.

Comment by Gerrit Updater [ 10/May/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/27045
Subject: LU-9384 ldiskfs: port upstream patches for project quota
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9b176c0082380ff0f209bfe95aeef61ce76a4789

Comment by Yang Sheng [ 10/May/17 ]

Patch commit 2e81a4eeedcaa66e35f58b81e0755b87057ce392 "ext4: avoid deadlock when expanding inode size"
Already has included since LU-9146. So we just need porting 5 patches.

Thanks,
YangSheng

Comment by Gerrit Updater [ 20/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27045/
Subject: LU-9384 ldiskfs: port upstream patches for changing extra isize
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 10283bfb35cff9980fc7a3f83f39007a22b21d0c

Comment by Peter Jones [ 20/May/17 ]

Landed for 2.10

Comment by Gerrit Updater [ 23/May/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/27244
Subject: LU-9384 ldiskfs: port extra isize patches to sles12sp2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fbbfe2cb89578b6b39f102cabc24387f050c6462

Comment by James Nunez (Inactive) [ 24/May/17 ]

Reopening since there is another patch outstanding for SLES12SP2. We also need to add back the 2.9 disk image upgrade to 2.10 that was removed in https://review.whamcloud.com/26789 .

Comment by Gerrit Updater [ 25/May/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/27285
Subject: LU-9384 ldiskfs: extra patch for changing extra isize
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 808356d5e5f4c141dac81672ed0d201bac8b0d20

Comment by Gerrit Updater [ 25/May/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27295
Subject: LU-9384 tests: restore 2.9 filesystem images on 2.10
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d9346e24627933aefb34feb3a4e681eff033ef6d

Comment by Gerrit Updater [ 03/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27244/
Subject: LU-9384 ldiskfs: port extra isize patches to sles12sp2
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 01d7ddd0edc1517b05136cb36314d7a39dbfeff3

Comment by Gerrit Updater [ 03/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27285/
Subject: LU-9384 ldiskfs: extra patch for changing extra isize
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 06159785986aef19a06a69d02b7f203fc98b377a

Comment by Gerrit Updater [ 07/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27295/
Subject: LU-9384 tests: restore 2.9 filesystem images on 2.10
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 50d6943184322b56d7f2243930b035d2a7c37e1e

Comment by Peter Jones [ 07/Jun/17 ]

Landed for 2.10

Comment by Chris Hunter (Inactive) [ 23/Jun/17 ]

To confirm my understanding
Let say I have an OST formatted with el6 kernel (2.6.32) using lustre 2.5 that has 256b inodes where dumpe2fs report extra isize is 28
eg)

##### Command: dumpe2fs -h /dev/mapper/dtemp_ost0077 ##########################################
Filesystem volume name:   dtemp-OST004d
..
Filesystem created:       Tue Oct 20 21:02:29 2015
...
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
...

If I mount the OST on host with kernel (eg. 3.10.0) that is patched with project quotas, it will cause additional xattr block to be allocated (ie. call _expand_extra_isize()) for OST inodes since the newer patched kernel expects extra isize=32b. The additional xattr block per inode may cause a performance problem. Additionally there are bugs in the upstream code that migrates inodes xattrs to the new block (eg. LU-9146) so we need backport some upstream kernel patches.
Alternative is the ldiskfs patches here that will accommodate new xattrs for PFL & project quotas (within the 256b inode size) without requiring allocation of a new xattr block.

Comment by Andreas Dilger [ 24/Jun/17 ]

There shouldn't really be a need to have an external xattr block, even with project quota enabled. There is just a bug in the xattr relocation code that pushes the xattr out of the j ode when it isn't really needed. We've backported patches to fix this.

Generated at Sat Feb 10 02:25:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.