[LU-16101] sanity test_27J: read should fail Created: 23/Aug/22  Updated: 08/Feb/24  Resolved: 17/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2, Lustre 2.15.3, Lustre 2.15.4
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLES15 SP4 client


Issue Links:
Related
is related to LU-17146 sanity-lfsck test_38: read should fail Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4751c6e1-6efd-4bec-8150-035008e531bb

test_27J failed with the following error:

lov_foreign_magic: 0x0BD70BD0
lov_xattr_size: 89
lov_foreign_size: 73
lov_foreign_type: 1
lov_foreign_flags: 0x0000DA08
lfm_magic:         0x0BD70BD0
lfm_length:          73
lfm_type:          0x00000000 (none)
lfm_flags:          0x0000DA08
lfm_value:     '138822a8-8810-46e8-9f71-e80e38c85596@4921d343-f166-41b7-83de-3ada0c94dfbd'
lfs setstripe: setstripe error for '/mnt/lustre/d27J.sanity/f27J.sanity': stripe already set
lfs setstripe: setstripe error for '/mnt/lustre/d27J.sanity/f27J.sanity2': stripe already set
 sanity test_27J: @@@@@@ FAIL: /mnt/lustre/d27J.sanity/f27J.sanity: read should fail 
  Trace dump:

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_27J - /mnt/lustre/d27J.sanity/f27J.sanity: read should fail



 Comments   
Comment by Minh Diep [ 20/Sep/22 ]

it seems this starting to fail on master after landing LU-15959 sles15sp4 kernel support

Comment by Peter Jones [ 29/Oct/22 ]

neilb stancheff simmonsja is this issue on your radars?

Comment by James A Simmons [ 29/Oct/22 ]

The only change was in the vvp layer due to an export issue. I doubt its due to this patch. Instead its a regression showing up on this platform.

Comment by Neil Brown [ 06/Dec/22 ]

This test failure is due to upstream Commit 8c8387ee3f55 ("mm: stop filemap_read() from grabbing a superfluous page").

This landed in v5.16, and SUSE has backported it to our SLE-15-SP4 kernels.

In earlier kernels the read will fail because filemap_read() will call the ->readpage method which detects the problem and reports -ENODATA.

In later kernels filemap_read() doesn't bother calling ->readpage because the size of the file is recorded in the inode as zero.  As ->readpage is not called, -ENODATA is not reported - there is no error.

To trigger a read error, we would need to make the file appear to be larger than 0.

If the file size isn't available conveniently we might have to bypass generic_file_read_iter() - for foreign files at least.  Or maybe my i_size of foreign files MAX_INT

Comment by Sarah Liu [ 21/Dec/22 ]

similar in sanity-lfsck on 2.15.2-rc1

https://testing.whamcloud.com/test_sets/2b0561f0-3a4a-4646-81af-8f3307966170

Comment by Peter Jones [ 11/Feb/23 ]

Can we add this test to the always_except list for SLES15 SP4 while we are working on the proper fix? It seems to be causing quite a bit of disruption...

Comment by Gerrit Updater [ 11/Feb/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49970
Subject: LU-16101 tests: add sanity/27J to always_except for SLES15 SP4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 49ad83985add99bb341785289819c44bc5b37a2c

Comment by Gerrit Updater [ 11/Feb/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49971
Subject: LU-16101 tests: add sanity/27J to always_except for SLES15 SP4
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: f82674690637802bdab8dc38a2b14240579557b5

Comment by Andreas Dilger [ 12/Feb/23 ]

In later kernels filemap_read() doesn't bother calling ->readpage because the size of the file is recorded in the inode as zero. As ->readpage is not called, -ENODATA is not reported - there is no error.
To trigger a read error, we would need to make the file appear to be larger than 0.

Jian's patch will skip this subtest for SLES15sp4 so that it doesn't always fail, but it doesn't fix the problem. Presumably there is something that needs to be done in llite to update the inode with the actual file size, instead of it being zero?

Comment by Gerrit Updater [ 17/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49970/
Subject: LU-16101 tests: add sanity/27J to always_except
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63dd644747f4eab20d640b4d87060e56c20bc37f

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49971/
Subject: LU-16101 tests: add sanity/27J to always_except
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1fc8fdf9070bdaf21050717124f4d511cd863d15

Comment by Jian Yu [ 28/Apr/23 ]

This test failure is due to upstream Commit 8c8387ee3f55 ("mm: stop filemap_read() from grabbing a superfluous page").

RHEL 9.2 Beta release with kernel 5.14.0-283.el9 also has this commit:

kernel.spec
* Mon Oct 24 2022 Frantisek Hrbata <fhrbata@redhat.com> [5.14.0-179.el9]
<~snip~>
- mm: stop filemap_read() from grabbing a superfluous page (Chris von Recklinghausen) [2120352]
Comment by Andreas Dilger [ 28/Apr/23 ]

I was wondering if the test could be changed, rather than expect the read should return an error, it should check that the read returns 0 bytes. However, it seems reasonable that the DAOS code (and user applications) should receive an error if it reads from a layout that is not available. Otherwise, applications may assume the file is corrupted instead of just not mapped in correctly.

Consider the behavior for files that are HSM released - we don't want the clients/applications to "successfully" return 0 bytes when reading such a file, but (normally) block until the file is restored, or in the worst case (some kind of client bug or copytool error) return an error because the file is inaccessible.

Comment by Gerrit Updater [ 04/Jul/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51567
Subject: LU-16101 tests: skip sanity/27J for more kernels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 754adcb735973b71aa51c97828d79287b8624d0f

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51567/
Subject: LU-16101 tests: skip sanity/27J for more kernels
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b711af7d243f3773cec3a37f64c0e0aa8bbc363f

Comment by Gerrit Updater [ 17/Jul/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51696
Subject: LU-16101 tests: skip 27J for 5.14 kernels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b445b885b490c72edde0fff4afeff6befe0fe46a

Comment by Patrick Farrell [ 17/Jul/23 ]

We forgot Jian's comment here that the problematic patch is in RHEL 5.14 kernels (and I think SLES as well from earlier notes?), so this is still failing.  I've pushed a patch.

 

No we didn't - sorry, sloppy reading on my part.  5.14 is included in the skip range...

Comment by Andreas Dilger [ 17/Jul/23 ]

The current patch is already skipping this subtest for all kernels between 5.12.0-6.2.0, which is from when this change was first introduced until Yingjin's fix was landed, so I don't think anything else is needed here.

Comment by Bruno Faccini (Inactive) [ 26/Sep/23 ]

> sarah Sarah Liu added a comment - 21/Dec/22 7:15 PM - edited
> similar in sanity-lfsck on 2.15.2-rc1
> https://testing.whamcloud.com/test_sets/2b0561f0-3a4a-4646-81af-8f3307966170

right, sanity-lfsck/test_38() needs the same change for same reason.

I have opened LU-17146 to address this.

Comment by Guillaume Courrier [ 08/Feb/24 ]

This issue was hit in this patch: https://review.whamcloud.com/c/fs/lustre-release/+/49236
The problem is that I'm doing interop testing between 2.15.4 (client) and master (server). Since the patch https://review.whamcloud.com/c/fs/lustre-release/+/51567 is not on b2_15, the client will run the test on servers with 5.14 kernel version.

Generated at Sat Feb 10 03:24:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.