[LU-16637] ll_truncate_inode_pages_final() ASSERTION( nrpages == 0 ) failed: ... nrpages=1 Created: 13/Mar/23  Updated: 26/Jan/24  Resolved: 23/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Zhenyu Xu Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: Zip Archive 8f66968d.diff.zip     Zip Archive e08812ca.diff.zip    
Issue Links:
Related
is related to LU-16958 migrate vs regular ops deadlock Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

truncate_inode_pages() is required to be called under (and serialised by) inode lock, or else new page could be added in the inode's page tree even ll_truncate_inode_pages_final() leverages mapping->i_pages protection to prevent a page is still in the process of deletion (inside __delete_from_page_cache())

649346.462967] LustreError: 482172:0:(llite_lib.c:2235:ll_truncate_inode_pages_final()) ASSERTION( nrpages == 0 ) failed: exa5: inode=[0x5c004247c:0x9b:0x0](000000000c1d2c09) nrpages=1, see https://jira.whamcloud.com/browse/LU-118
[649346.483202] LustreError: 482172:0:(llite_lib.c:2235:ll_truncate_inode_pages_final()) LBUG
[649346.491472] Pid: 482172, comm: nsmb6.09.23.mpi 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022
[649346.501648] Call Trace TBD:
[649346.504644] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[649346.509880] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
[649346.514891] [<0>] ll_truncate_inode_pages_final+0xa7/0xe0 [lustre]
[649346.521185] [<0>] vvp_prune+0x178/0x1d0 [lustre]
[649346.525932] [<0>] cl_object_prune+0x51/0x110 [obdclass]
[649346.531263] [<0>] lov_conf_set+0x5ad/0xa90 [lov]
[649346.536007] [<0>] cl_conf_set+0x59/0x110 [obdclass]
[649346.540995] [<0>] ll_layout_conf+0x13e/0x3d0 [lustre]
[649346.546160] [<0>] ll_layout_refresh+0x581/0x980 [lustre]
[649346.551588] [<0>] vvp_io_init+0x225/0x370 [lustre]
[649346.556502] [<0>] cl_io_init0.isra.15+0x84/0x130 [obdclass]
[649346.562186] [<0>] ll_fault_io_init+0x1cf/0x430 [lustre]
[649346.567527] [<0>] ll_fault+0x222/0x900 [lustre]
[649346.572158] [<0>] __do_fault+0x38/0xb0
[649346.576000] [<0>] do_fault+0x1a0/0x3c0
[649346.579851] [<0>] __handle_mm_fault+0x4d5/0x820
[649346.584473] [<0>] handle_mm_fault+0xbe/0x1e0
[649346.588836] [<0>] __do_page_fault+0x1ed/0x4c0
[649346.593289] [<0>] do_page_fault+0x37/0x130
[649346.597478] [<0>] page_fault+0x1e/0x30
[649346.601326] Kernel panic - not syncing: LBUG


 Comments   
Comment by Gerrit Updater [ 14/Mar/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50284
Subject: LU-16637 llite: call truncate_inode_pages() under inode lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8d7eba935a6117e3afa8dd951358172fb14f08b7

Comment by Gerrit Updater [ 21/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50284/
Subject: LU-16637 llite: call truncate_inode_pages() under inode lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ef9be34478036db0544753e33030fff7e32bfe44

Comment by Zhenyu Xu [ 22/Mar/23 ]

Noted that in a later kernel (v5.14) commit, and invalidate_lock was added to protect adding page to page cache, we might need to adapt that in new EL version.

commit 730633f0b7f951726e87f912a6323641f674ae34
Author:     Jan Kara <jack@suse.cz>
AuthorDate: Thu Jan 28 19:19:45 2021 +0100
Commit:     Jan Kara <jack@suse.cz>
CommitDate: Tue Jul 13 13:14:27 2021 +0200

    mm: Protect operations adding pages to page cache with invalidate_lock
    
    Currently, serializing operations such as page fault, read, or readahead
    against hole punching is rather difficult. The basic race scheme is
    like:
    
    fallocate(FALLOC_FL_PUNCH_HOLE)                 read / fault / ..
      truncate_inode_pages_range()
                                                      <create pages in page
                                                       cache here>
      <update fs block mapping and free blocks>
    
    Now the problem is in this way read / page fault / readahead can
    instantiate pages in page cache with potentially stale data (if blocks
    get quickly reused). Avoiding this race is not simple - page locks do
    not work because we want to make sure there are *no* pages in given
    range. inode->i_rwsem does not work because page fault happens under
    mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
    the performance for mixed read-write workloads suffer.
    
    So create a new rw_semaphore in the address_space - invalidate_lock -
    that protects adding of pages to page cache for page faults / reads /
    readahead.
    
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

Comment by Peter Jones [ 24/Mar/23 ]

Bobijam

So does that mean that RHEL9.x clients would require a different fix?

Peter

Comment by Zhenyu Xu [ 28/Mar/23 ]

I guess so, truncate_inode_pages need to acquire invalidate_lock to protect from pages cache filling (fault, read, ...) in kernel > v5.14

Comment by Patrick Farrell [ 31/Mar/23 ]

qian_wc , do your patches with the invalidate_lock cover this case as well?

Comment by Qian Yingjin [ 14/Apr/23 ]

Hi,

I originally thought that we need invalidate_lock in the following case:
There are four place that will delete a page from Lustre:
1. extent lock blocking AST callback
2. truncate
3. punch hole
4. layout change will prune the pages

Then I wrote test scripts to verify it, but found that:
1. Extent lock blocking AST callback: we solve it in the ticket LU-16651 llite: hold invalidate_lock when invalidate cache pages;
2. Truncate: For user space truncate() operation, we hole [size, EOF] extent lock for the truncate() operations, thus it is same problem as Case 1.
3. punch hole: Lustre currently does not support it.
4. layout change will prune the pages: we can only change layout when there is no I/O activities (@&lov->lo_active_ios == 0), so this can also not happen.

So I think the patch under LU-16651 is enough.
It does not need to hold invalidate_lock in this case for the current Lustre, I think.

Comment by Gerrit Updater [ 04/May/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50857
Subject: LU-16637 llite: call truncage_inode_pages() in inode lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94d493f7b99eaa681628f068e8d18979e99fdb26

Comment by Gerrit Updater [ 08/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50857/
Subject: LU-16637 llite: call truncate_inode_pages() in inode lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 51d62f2122fee14fbb3ff8333b5a830e1181e4e5

Comment by Peter Jones [ 09/Jul/23 ]

Landed for 2.16

Comment by Andrew Perepechko [ 01/Oct/23 ]

I'm able to reproduce the assertion failure with the fix in place. 

The reproducer is the following:

swap()
{
        cd /mnt/lustre

        while true; do
                lfs swap_layouts f1 f2
        done
}

writedata()
{
        while true; do
                echo -n a
        done > /mnt/lustre/f1
}

touch /mnt/lustre/f1
touch /mnt/lustre/f2

swap &

writedata &
writedata &
writedata &
writedata &
writedata &
writedata &
writedata &
writedata &

wait

The following change was made to Lustre hash c1ebcb0fa to speed up reproduction:

diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c
index eba1cf4ff51..9542fb98e97 100644
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -3011,6 +3011,8 @@ void ll_truncate_inode_pages_final(struct inode *inode)
 
        truncate_inode_pages_final(mapping);
 
+       udelay(10000);
+
        /* Workaround for LU-118: Note nrpages may not be totally updated when
         * truncate_inode_pages() returns, as there can be a page in the process
         * of deletion (inside __delete_from_page_cache()) in the specified
diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c
index f3f8101dadc..8fd223a7b1a 100644
--- a/lustre/utils/lfs.c
+++ b/lustre/utils/lfs.c
@@ -10765,8 +10765,8 @@ static int lfs_swap_layouts(int argc, char **argv)
                return CMD_HELP;
 
        return llapi_swap_layouts(argv[1], argv[2], 0, 0,
-                                 SWAP_LAYOUTS_KEEP_MTIME |
-                                 SWAP_LAYOUTS_KEEP_ATIME);
+                                 /* SWAP_LAYOUTS_KEEP_MTIME |
+                                 SWAP_LAYOUTS_KEEP_ATIME*/ 0);
 }
 
 static const char *const ladvise_names[] = LU_LADVISE_NAMES;

After a few minutes Lustre panics with:

[ 2104.190146] LustreError: 24712:0:(llite_lib.c:3035:ll_truncate_inode_pages_final()) ASSERTION( nrpages == 0 ) failed: lustre: inode=[0x200000401:0x1:0x0](000000007d0affed) nrpages=1 state 0x5, lli_flags 0x15, see https://jira.whamcloud.com/browse/LU-118
[ 2104.458673] LustreError: 24712:0:(llite_lib.c:3035:ll_truncate_inode_pages_final()) LBUG
[ 2104.555802] Kernel panic - not syncing: LBUG
[ 2104.606854] CPU: 6 PID: 24712 Comm: bash Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-305.10.2.x6.1.010.21.x86_64 #1
[ 2104.751515] Hardware name: Gigabyte Technology Co., Ltd. P55-US3L/P55-US3L, BIOS FE 03/05/2010
[ 2104.854787] Call Trace:
[ 2104.884121]  dump_stack+0x5c/0x80
[ 2104.923848]  panic+0xe7/0x2a9
[ 2104.959422]  ? entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2105.021930]  lbug_with_loc+0x85/0x90 [libcfs]
[ 2105.074052]  ll_truncate_inode_pages_final+0xd5/0x170 [lustre]
[ 2105.143927]  vvp_prune+0x1c6/0x2a0 [lustre]
[ 2105.193978]  cl_object_prune+0x72/0x140 [obdclass]
[ 2105.251457]  lov_conf_set+0x1c7/0xde0 [lov]
[ 2105.301772]  ? batch_send_update_req.constprop.8+0x381/0x8a0 [ptlrpc]
[ 2105.378870]  cl_conf_set+0x7a/0x150 [obdclass]
[ 2105.432215]  ll_layout_conf+0x8b/0x420 [lustre]
[ 2105.486467]  ? lock_res_and_lock+0x29/0x50 [ptlrpc]
[ 2105.544846]  ? ll_layout_refresh+0x79b/0xa20 [lustre]
[ 2105.605379]  ll_layout_refresh+0x79b/0xa20 [lustre]
[ 2105.663819]  vvp_io_init+0x22a/0x400 [lustre]
[ 2105.716041]  __cl_io_init.isra.13+0x9e/0x170 [obdclass]
[ 2105.778737]  ll_file_io_generic+0x317/0xdf0 [lustre]
[ 2105.838228]  ll_file_write_iter+0x5d4/0x8d0 [lustre]
[ 2105.897596]  ? __switch_to_asm+0x35/0x70
[ 2105.944499]  ? __switch_to_asm+0x41/0x70
[ 2105.991419]  new_sync_write+0x112/0x160
[ 2106.037378]  vfs_write+0xa5/0x1a0
[ 2106.077099]  ksys_write+0x4f/0xb0
[ 2106.116730]  do_syscall_64+0x5b/0x1a0
[ 2106.160514]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2106.220935] RIP: 0033:0x7f25853f5ba0
[ 2106.263778] Code: 73 01 c3 48 8b 0d d0 72 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 1d d4 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e cc 01 00 48 89 04 24
[ 2106.489251] RSP: 002b:00007ffdb6797928 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2106.580144] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f25853f5ba0
[ 2106.665733] RDX: 0000000000000001 RSI: 00007f2585d19000 RDI: 0000000000000001
[ 2106.751223] RBP: 00007f2585d19000 R08: 00007f2585d06740 R09: 00007f25853532cd
[ 2106.836919] R10: 00007f2585d06740 R11: 0000000000000246 R12: 00007f25856ce400
[ 2106.922614] R13: 0000000000000001 R14: 0000000000000000 R15: 00000000026364c2

As I wrote in the gerrit review, non-zero page count after truncate is likely the result of
a read/write/fault running in parallel with truncate. Most obviously, in a write test tiny
writes add pages to the page cache even though they cannot initialize and run I/O on them.

It seems, one of the possible solutions is to follow the approach from 9c453ba6d9. Make sure truncates are wrapped with trunc_sem in exclusive mode and reads, writes and faults including their fast paths are wrapped with trunc sem in shared mode.

Comment by Zhenyu Xu [ 04/Dec/23 ]

Hi Andrew,

Is it possibly better to loose the assertion condition to (LASSERTF(nrpages == 0 || nrpages == 1, ...)? I don't know whether this window would allow more than 1 page be inserted in the page cache, but just observed that all these cases shows only 1 page is in the page cache after tripping the assertion failure.

Comment by Gerrit Updater [ 04/Dec/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53316
Subject: LU-16637 llite: page in cache after truncate_inode_pages()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e67690e9da79362787c7bf959823fcfe0604dcd8

Comment by Andrew Perepechko [ 26/Dec/23 ]

@bobijam,

 

I think there can be any number of pages, since you can have any number of parallel read requests, each of them can add a new cache page.

Comment by Gerrit Updater [ 26/Dec/23 ]

"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53554
Subject: LU-16637 llite: tolerate fresh page cache pages after truncate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9aade50eee736299d08405c1c9e3fee5b5fa8f7

Comment by Gerrit Updater [ 23/Jan/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53554/
Subject: LU-16637 llite: tolerate fresh page cache pages after truncate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7bb1e211d217d5a82ac2d5e4edad5ae018090761

Comment by Peter Jones [ 23/Jan/24 ]

All patches merged for 2.16

Generated at Sat Feb 10 03:28:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.