[LU-17943] conf-sanity test_32d: FAIL: set project failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.15.5
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/0101ce36-e2a5-4868-a945-bceb058a322f

test_32d failed with the following error:

Timeout occurred after 483 minutes, last suite running was conf-sanity

Test session details:
clients: https://build.whamcloud.com/job/lustre-b2_15/88 - 4.18.0-553.el8_10.x86_64
servers: https://build.whamcloud.com/job/lustre-b2_15/88 - 4.18.0-553.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

onyx-24vm12: Pool t32fs.interop created
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/init.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc0.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc1.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc2.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc3.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc4.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc5.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc6.d': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc.local': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/rc.sysinit': Value too large for defined data type
lfs: failed to set xattr for '/tmp/t32/mnt/lustre/t32_qf_old': Value too large for defined data type
 conf-sanity test_32d: @@@@@@ FAIL: set project failed

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_32d - Timeout occurred after 483 minutes, last suite running was conf-sanity

Attachments

Issue Links

is related to

LU-10215 disk2_5-ldiskfs.tar.bz2 is not packaged into test rpm

Resolved

LU-13519 expand inode if possible for project quota

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-17943] conf-sanity test_32d: FAIL: set project failed

Peter Jones added a comment - 12/Sep/24 7:30 AM

Merged for 2.16

Peter Jones added a comment - 12/Sep/24 7:30 AM Merged for 2.16

Gerrit Updater added a comment - 12/Sep/24 5:52 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55673/
Subject: ~~LU-17943~~ osd-ldiskfs: initialize dquot before expanding inode size
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3fd57f81fddc604aa94bc7797cc211c7e393b3d0

Gerrit Updater added a comment - 12/Sep/24 5:52 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55673/ Subject: LU-17943 osd-ldiskfs: initialize dquot before expanding inode size Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3fd57f81fddc604aa94bc7797cc211c7e393b3d0

Gerrit Updater added a comment - 09/Jul/24 1:08 PM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55673
Subject: ~~LU-17943~~ osd-ldiskfs: initialize dquot before expanding inode size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 04f7854d9321bd72bc484e5ba78ed9099536bde2

Gerrit Updater added a comment - 09/Jul/24 1:08 PM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55673 Subject: LU-17943 osd-ldiskfs: initialize dquot before expanding inode size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 04f7854d9321bd72bc484e5ba78ed9099536bde2

Peter Jones added a comment - 17/Jun/24 3:18 AM

Sounds good - and I think it's ok to just tidy this up for 2.15.6 vs delaying 2.15.5

Peter Jones added a comment - 17/Jun/24 3:18 AM Sounds good - and I think it's ok to just tidy this up for 2.15.6 vs delaying 2.15.5

Dongyang Li added a comment - 17/Jun/24 1:53 AM - edited

The inode size from the 2.4 and the 2.5 image is ok, the issue is the extra_isize:

Inode size:               512
Required extra isize:     28
Desired extra isize:      28

The extra_isize set in superblock is 28, I think the image was created with mke2fs without project quota support?
It should be sizeof(struct ext2_inode_large) - EXT2_GOOD_OLD_INODE_SIZE, which is now 32. Using 28 as extra_isize means we just miss out saving the project_id in the inode as it's at the very end of ext2_inode_large.

The images for 2.7+ are all ok.

So to fix this I think we need to set the new extra_isize when turning on project quota in tune2fs, and then run e2fsck to expand the i_size for every inode in use. So I prefer maybe just port the ~~LU-10215~~ tests: remove disk2_4 disk2_5 images to b2_15

Dongyang Li added a comment - 17/Jun/24 1:53 AM - edited The inode size from the 2.4 and the 2.5 image is ok, the issue is the extra_isize: Inode size: 512 Required extra isize: 28 Desired extra isize: 28 The extra_isize set in superblock is 28, I think the image was created with mke2fs without project quota support? It should be sizeof(struct ext2_inode_large) - EXT2_GOOD_OLD_INODE_SIZE, which is now 32. Using 28 as extra_isize means we just miss out saving the project_id in the inode as it's at the very end of ext2_inode_large. The images for 2.7+ are all ok. So to fix this I think we need to set the new extra_isize when turning on project quota in tune2fs, and then run e2fsck to expand the i_size for every inode in use. So I prefer maybe just port the LU-10215 tests: remove disk2_4 disk2_5 images to b2_15

Andreas Dilger added a comment - 15/Jun/24 5:39 AM

If you can confirm that this test is only having problems with an upgrade from a 2.4 filesystem that doesn't have larger MDT or OST inodes, then I don't think it is a real concern for us. I was only worried that it might also have some impact on newer systems.

Andreas Dilger added a comment - 15/Jun/24 5:39 AM If you can confirm that this test is only having problems with an upgrade from a 2.4 filesystem that doesn't have larger MDT or OST inodes, then I don't think it is a real concern for us. I was only worried that it might also have some impact on newer systems.

Dongyang Li added a comment - 14/Jun/24 5:33 AM

log from mdt:

[23670.629105] LustreError: 516086:0:(osd_handler.c:3151:osd_quota_transfer()) t32fs-MDT0000: quota transfer failed. Is project enforcement enabled on the ldiskfs filesystem? rc = -75
[23675.572899] LustreError: 514936:0:(osd_handler.c:3151:osd_quota_transfer()) t32fs-MDT0000: quota transfer failed. Is project enforcement enabled on the ldiskfs filesystem? rc = -75
[23675.864983] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_32d: @@@@@@ FAIL: set project failed 
[23676.097944] Lustre: DEBUG MARKER: conf-sanity test_32d: @@@@@@ FAIL: set project failed

75 is EOVERFLOW, looks like we failed to expand isize? checking the details.

Dongyang Li added a comment - 14/Jun/24 5:33 AM log from mdt: [23670.629105] LustreError: 516086:0:(osd_handler.c:3151:osd_quota_transfer()) t32fs-MDT0000: quota transfer failed. Is project enforcement enabled on the ldiskfs filesystem? rc = -75 [23675.572899] LustreError: 514936:0:(osd_handler.c:3151:osd_quota_transfer()) t32fs-MDT0000: quota transfer failed. Is project enforcement enabled on the ldiskfs filesystem? rc = -75 [23675.864983] Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_32d: @@@@@@ FAIL: set project failed [23676.097944] Lustre: DEBUG MARKER: conf-sanity test_32d: @@@@@@ FAIL: set project failed 75 is EOVERFLOW, looks like we failed to expand isize? checking the details.

Peter Jones added a comment - 14/Jun/24 12:59 AM

I'm flagging the fix version as 2.15.5 as you've indicated that it warrants investigation but, based on the comments, I am not sure whether this is warranted - there should be no expectation of upgrading from something as old as 2.4 - even 2.10 would be a push...

Peter Jones added a comment - 14/Jun/24 12:59 AM I'm flagging the fix version as 2.15.5 as you've indicated that it warrants investigation but, based on the comments, I am not sure whether this is warranted - there should be no expectation of upgrading from something as old as 2.4 - even 2.10 would be a push...

Andreas Dilger added a comment - 14/Jun/24 12:34 AM

Hi Dongyang, could you please take a closer look into this? I wonder if something in the new el8 kernel ext4 is causing this to fail. I'm not so much worried about the Lustre 2.4 MDT upgrade, but possibly it could affect newer versions since this subtest is only run with project_upgrade=yes for this kernel version and then exits, so it may be skipping other tests.

Andreas Dilger added a comment - 14/Jun/24 12:34 AM Hi Dongyang, could you please take a closer look into this? I wonder if something in the new el8 kernel ext4 is causing this to fail. I'm not so much worried about the Lustre 2.4 MDT upgrade, but possibly it could affect newer versions since this subtest is only run with project_upgrade=yes for this kernel version and then exits, so it may be skipping other tests.

Andreas Dilger added a comment - 14/Jun/24 12:31 AM

This looks like potentially a real bug, but it is only affecting upgrades from 2.4 MDT images, so I'm not sure how critical it is?

There have been only 4 timeouts in the past 6 months, and 3 of them were in the past week on b2_15 testing, so it seems possible that something which landed to b2_15 is causing a regression in this test? The servers are either (once) el8.9 or (twice) el8.10 so there may be some issue with the xattr format or projid values being stored by ldiskfs.

Andreas Dilger added a comment - 14/Jun/24 12:31 AM This looks like potentially a real bug, but it is only affecting upgrades from 2.4 MDT images, so I'm not sure how critical it is? There have been only 4 timeouts in the past 6 months, and 3 of them were in the past week on b2_15 testing, so it seems possible that something which landed to b2_15 is causing a regression in this test? The servers are either (once) el8.9 or (twice) el8.10 so there may be some issue with the xattr format or projid values being stored by ldiskfs.

People

Assignee:: Dongyang Li

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Jun/24 9:12 PM

Updated:: 12/Sep/24 7:30 AM

Resolved:: 12/Sep/24 7:30 AM