[LU-16637] ll_truncate_inode_pages_final() ASSERTION( nrpages == 0 ) failed: ... nrpages=1 Created: 13/Mar/23 Updated: 26/Jan/24 Resolved: 23/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Zhenyu Xu | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
truncate_inode_pages() is required to be called under (and serialised by) inode lock, or else new page could be added in the inode's page tree even ll_truncate_inode_pages_final() leverages mapping->i_pages protection to prevent a page is still in the process of deletion (inside __delete_from_page_cache()) 649346.462967] LustreError: 482172:0:(llite_lib.c:2235:ll_truncate_inode_pages_final()) ASSERTION( nrpages == 0 ) failed: exa5: inode=[0x5c004247c:0x9b:0x0](000000000c1d2c09) nrpages=1, see https://jira.whamcloud.com/browse/LU-118 [649346.483202] LustreError: 482172:0:(llite_lib.c:2235:ll_truncate_inode_pages_final()) LBUG [649346.491472] Pid: 482172, comm: nsmb6.09.23.mpi 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 [649346.501648] Call Trace TBD: [649346.504644] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [649346.509880] [<0>] lbug_with_loc+0x43/0x80 [libcfs] [649346.514891] [<0>] ll_truncate_inode_pages_final+0xa7/0xe0 [lustre] [649346.521185] [<0>] vvp_prune+0x178/0x1d0 [lustre] [649346.525932] [<0>] cl_object_prune+0x51/0x110 [obdclass] [649346.531263] [<0>] lov_conf_set+0x5ad/0xa90 [lov] [649346.536007] [<0>] cl_conf_set+0x59/0x110 [obdclass] [649346.540995] [<0>] ll_layout_conf+0x13e/0x3d0 [lustre] [649346.546160] [<0>] ll_layout_refresh+0x581/0x980 [lustre] [649346.551588] [<0>] vvp_io_init+0x225/0x370 [lustre] [649346.556502] [<0>] cl_io_init0.isra.15+0x84/0x130 [obdclass] [649346.562186] [<0>] ll_fault_io_init+0x1cf/0x430 [lustre] [649346.567527] [<0>] ll_fault+0x222/0x900 [lustre] [649346.572158] [<0>] __do_fault+0x38/0xb0 [649346.576000] [<0>] do_fault+0x1a0/0x3c0 [649346.579851] [<0>] __handle_mm_fault+0x4d5/0x820 [649346.584473] [<0>] handle_mm_fault+0xbe/0x1e0 [649346.588836] [<0>] __do_page_fault+0x1ed/0x4c0 [649346.593289] [<0>] do_page_fault+0x37/0x130 [649346.597478] [<0>] page_fault+0x1e/0x30 [649346.601326] Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Gerrit Updater [ 14/Mar/23 ] |
|
"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50284 |
| Comment by Gerrit Updater [ 21/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50284/ |
| Comment by Zhenyu Xu [ 22/Mar/23 ] |
|
Noted that in a later kernel (v5.14) commit, and invalidate_lock was added to protect adding page to page cache, we might need to adapt that in new EL version. commit 730633f0b7f951726e87f912a6323641f674ae34
Author: Jan Kara <jack@suse.cz>
AuthorDate: Thu Jan 28 19:19:45 2021 +0100
Commit: Jan Kara <jack@suse.cz>
CommitDate: Tue Jul 13 13:14:27 2021 +0200
mm: Protect operations adding pages to page cache with invalidate_lock
Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:
fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
truncate_inode_pages_range()
<create pages in page
cache here>
<update fs block mapping and free blocks>
Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.
So create a new rw_semaphore in the address_space - invalidate_lock -
that protects adding of pages to page cache for page faults / reads /
readahead.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
|
| Comment by Peter Jones [ 24/Mar/23 ] |
|
Bobijam So does that mean that RHEL9.x clients would require a different fix? Peter |
| Comment by Zhenyu Xu [ 28/Mar/23 ] |
|
I guess so, truncate_inode_pages need to acquire invalidate_lock to protect from pages cache filling (fault, read, ...) in kernel > v5.14 |
| Comment by Patrick Farrell [ 31/Mar/23 ] |
|
qian_wc , do your patches with the invalidate_lock cover this case as well? |
| Comment by Qian Yingjin [ 14/Apr/23 ] |
|
Hi, I originally thought that we need invalidate_lock in the following case: Then I wrote test scripts to verify it, but found that: So I think the patch under |
| Comment by Gerrit Updater [ 04/May/23 ] |
|
"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50857 |
| Comment by Gerrit Updater [ 08/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50857/ |
| Comment by Peter Jones [ 09/Jul/23 ] |
|
Landed for 2.16 |
| Comment by Andrew Perepechko [ 01/Oct/23 ] |
|
I'm able to reproduce the assertion failure with the fix in place. The reproducer is the following:
swap()
{
cd /mnt/lustre
while true; do
lfs swap_layouts f1 f2
done
}
writedata()
{
while true; do
echo -n a
done > /mnt/lustre/f1
}
touch /mnt/lustre/f1
touch /mnt/lustre/f2
swap &
writedata &
writedata &
writedata &
writedata &
writedata &
writedata &
writedata &
writedata &
wait
The following change was made to Lustre hash c1ebcb0fa to speed up reproduction: diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c index eba1cf4ff51..9542fb98e97 100644 --- a/lustre/llite/llite_lib.c +++ b/lustre/llite/llite_lib.c @@ -3011,6 +3011,8 @@ void ll_truncate_inode_pages_final(struct inode *inode) truncate_inode_pages_final(mapping); + udelay(10000); + /* Workaround for LU-118: Note nrpages may not be totally updated when * truncate_inode_pages() returns, as there can be a page in the process * of deletion (inside __delete_from_page_cache()) in the specified diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c index f3f8101dadc..8fd223a7b1a 100644 --- a/lustre/utils/lfs.c +++ b/lustre/utils/lfs.c @@ -10765,8 +10765,8 @@ static int lfs_swap_layouts(int argc, char **argv) return CMD_HELP; return llapi_swap_layouts(argv[1], argv[2], 0, 0, - SWAP_LAYOUTS_KEEP_MTIME | - SWAP_LAYOUTS_KEEP_ATIME); + /* SWAP_LAYOUTS_KEEP_MTIME | + SWAP_LAYOUTS_KEEP_ATIME*/ 0); } static const char *const ladvise_names[] = LU_LADVISE_NAMES; After a few minutes Lustre panics with:
[ 2104.190146] LustreError: 24712:0:(llite_lib.c:3035:ll_truncate_inode_pages_final()) ASSERTION( nrpages == 0 ) failed: lustre: inode=[0x200000401:0x1:0x0](000000007d0affed) nrpages=1 state 0x5, lli_flags 0x15, see https://jira.whamcloud.com/browse/LU-118
[ 2104.458673] LustreError: 24712:0:(llite_lib.c:3035:ll_truncate_inode_pages_final()) LBUG
[ 2104.555802] Kernel panic - not syncing: LBUG
[ 2104.606854] CPU: 6 PID: 24712 Comm: bash Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.1.010.21.x86_64 #1
[ 2104.751515] Hardware name: Gigabyte Technology Co., Ltd. P55-US3L/P55-US3L, BIOS FE 03/05/2010
[ 2104.854787] Call Trace:
[ 2104.884121] dump_stack+0x5c/0x80
[ 2104.923848] panic+0xe7/0x2a9
[ 2104.959422] ? entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2105.021930] lbug_with_loc+0x85/0x90 [libcfs]
[ 2105.074052] ll_truncate_inode_pages_final+0xd5/0x170 [lustre]
[ 2105.143927] vvp_prune+0x1c6/0x2a0 [lustre]
[ 2105.193978] cl_object_prune+0x72/0x140 [obdclass]
[ 2105.251457] lov_conf_set+0x1c7/0xde0 [lov]
[ 2105.301772] ? batch_send_update_req.constprop.8+0x381/0x8a0 [ptlrpc]
[ 2105.378870] cl_conf_set+0x7a/0x150 [obdclass]
[ 2105.432215] ll_layout_conf+0x8b/0x420 [lustre]
[ 2105.486467] ? lock_res_and_lock+0x29/0x50 [ptlrpc]
[ 2105.544846] ? ll_layout_refresh+0x79b/0xa20 [lustre]
[ 2105.605379] ll_layout_refresh+0x79b/0xa20 [lustre]
[ 2105.663819] vvp_io_init+0x22a/0x400 [lustre]
[ 2105.716041] __cl_io_init.isra.13+0x9e/0x170 [obdclass]
[ 2105.778737] ll_file_io_generic+0x317/0xdf0 [lustre]
[ 2105.838228] ll_file_write_iter+0x5d4/0x8d0 [lustre]
[ 2105.897596] ? __switch_to_asm+0x35/0x70
[ 2105.944499] ? __switch_to_asm+0x41/0x70
[ 2105.991419] new_sync_write+0x112/0x160
[ 2106.037378] vfs_write+0xa5/0x1a0
[ 2106.077099] ksys_write+0x4f/0xb0
[ 2106.116730] do_syscall_64+0x5b/0x1a0
[ 2106.160514] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 2106.220935] RIP: 0033:0x7f25853f5ba0
[ 2106.263778] Code: 73 01 c3 48 8b 0d d0 72 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 1d d4 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e cc 01 00 48 89 04 24
[ 2106.489251] RSP: 002b:00007ffdb6797928 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2106.580144] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f25853f5ba0
[ 2106.665733] RDX: 0000000000000001 RSI: 00007f2585d19000 RDI: 0000000000000001
[ 2106.751223] RBP: 00007f2585d19000 R08: 00007f2585d06740 R09: 00007f25853532cd
[ 2106.836919] R10: 00007f2585d06740 R11: 0000000000000246 R12: 00007f25856ce400
[ 2106.922614] R13: 0000000000000001 R14: 0000000000000000 R15: 00000000026364c2
As I wrote in the gerrit review, non-zero page count after truncate is likely the result of It seems, one of the possible solutions is to follow the approach from 9c453ba6d9. Make sure truncates are wrapped with trunc_sem in exclusive mode and reads, writes and faults including their fast paths are wrapped with trunc sem in shared mode. |
| Comment by Zhenyu Xu [ 04/Dec/23 ] |
|
Hi Andrew, Is it possibly better to loose the assertion condition to (LASSERTF(nrpages == 0 || nrpages == 1, ...)? I don't know whether this window would allow more than 1 page be inserted in the page cache, but just observed that all these cases shows only 1 page is in the page cache after tripping the assertion failure. |
| Comment by Gerrit Updater [ 04/Dec/23 ] |
|
"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53316 |
| Comment by Andrew Perepechko [ 26/Dec/23 ] |
|
@bobijam,
I think there can be any number of pages, since you can have any number of parallel read requests, each of them can add a new cache page. |
| Comment by Gerrit Updater [ 26/Dec/23 ] |
|
"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53554 |
| Comment by Gerrit Updater [ 23/Jan/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53554/ |
| Comment by Peter Jones [ 23/Jan/24 ] |
|
All patches merged for 2.16 |