[LU-16579] llite: Fix the wrong ending offset calculation Created: 20/Feb/23  Updated: 25/Apr/23  Resolved: 21/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Major
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16412 check truncated page in ->read page() Resolved
is related to LU-16338 read-ahead more than file size for a ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

For a single-stripe file, after applied https://review.whamcloud.com/c/fs/lustre-release/+/49226
Subject: LU-16338 readahead: clip readahead with kms,
It cause a dead loop:
https://testing.whamcloud.com/test_logs/8d73c4e7-0e6e-482c-b93c-bf7159706890/show_text

00000080:00200000:1.0:1676614736.277040:0:21215:0:(rw.c:1981:ll_readpage()) pgno:260, cnt:1032192, pos:0
00000080:00000001:1.0:1676614736.277041:0:21215:0:(vvp_io.c:1654:vvp_io_read_ahead()) Process entered
00000080:00000001:1.0:1676614736.277041:0:21215:0:(vvp_io.c:1666:vvp_io_read_ahead()) Process leaving (rc=0 : 0 : 0)
00000008:00000001:1.0:1676614736.277042:0:21215:0:(osc_io.c:83:osc_io_read_ahead()) Process entered
00000008:00000001:1.0:1676614736.277042:0:21215:0:(osc_lock.c:1281:osc_obj_dlmlock_at_pgoff()) Process entered
00000008:00000001:1.0:1676614736.277043:0:21215:0:(osc_request.c:3137:osc_match_base()) Process entered
00000008:00000001:1.0:1676614736.277043:0:21215:0:(osc_request.c:3172:osc_match_base()) Process leaving (rc=4 : 4 : 4)
00000008:00000001:1.0:1676614736.277044:0:21215:0:(osc_lock.c:1315:osc_obj_dlmlock_at_pgoff()) Process leaving (rc=18446620395345229440 : -123678364322176 : ffff8f83e3073680)
00000008:00000001:1.0:1676614736.277045:0:21215:0:(osc_io.c:112:osc_io_read_ahead()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:1.0:1676614736.277045:0:21215:0:(rw.c:2012:ll_readpage()) Process leaving (rc=524289 : 524289 : 80001)
00000080:00000001:1.0:1676614736.277046:0:21215:0:(rw.c:1873:ll_readpage()) Process entered
00000080:00200000:1.0:1676614736.277046:0:21215:0:(rw.c:1981:ll_readpage()) pgno:260, cnt:1032192, pos:0
00000080:00000001:1.0:1676614736.277047:0:21215:0:(vvp_io.c:1654:vvp_io_read_ahead()) Process entered
00000080:00000001:1.0:1676614736.277047:0:21215:0:(vvp_io.c:1666:vvp_io_read_ahead()) Process leaving (rc=0 : 0 : 0)
00000008:00000001:1.0:1676614736.277048:0:21215:0:(osc_io.c:83:osc_io_read_ahead()) Process entered
00000008:00000001:1.0:1676614736.277048:0:21215:0:(osc_lock.c:1281:osc_obj_dlmlock_at_pgoff()) Process entered
00000008:00000001:1.0:1676614736.277049:0:21215:0:(osc_request.c:3137:osc_match_base()) Process entered
00000008:00000001:1.0:1676614736.277049:0:21215:0:(osc_request.c:3172:osc_match_base()) Process leaving (rc=4 : 4 : 4)
00000008:00000001:1.0:1676614736.277050:0:21215:0:(osc_lock.c:1315:osc_obj_dlmlock_at_pgoff()) Process leaving (rc=18446620395345229440 : -123678364322176 : ffff8f83e3073680)
00000008:00000001:1.0:1676614736.277051:0:21215:0:(osc_io.c:112:osc_io_read_ahead()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:1.0:1676614736.277051:0:21215:0:(rw.c:2012:ll_readpage()) Process leaving (rc=524289 : 524289 : 80001)

The dead loop code lines:

if (cl_offset(clob, vmpage->index) >= iter->count + iocb->ki_pos) {
                        result = cl_io_read_ahead(env, io, vmpage->index, &ra);
                        if (result < 0 || vmpage->index > ra.cra_end_idx) {
                                cl_read_ahead_release(env, &ra);
                                unlock_page(vmpage);
                                RETURN(AOP_TRUNCATED_PAGE); ===>AOP_TRUNCATED_PAGE = 80001
                        }
                }

After investigated, I found the ending beyond offset calculation is wrong as each time finished the read of a page, it will advance the @iter->count (iter->count - read_bytes).
The wrong ending offset results in the dead loop described above.



 Comments   
Comment by Andreas Dilger [ 20/Feb/23 ]

Yingjin, since the LU-16338 patch has not landed yet, why not just fix this as part of that patch? Is this issue caused by this patch, or just an existing bug exposed by that change?

Comment by Gerrit Updater [ 20/Feb/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50065
Subject: LU-16579 llite: fix the wrong beyond read end calculation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c4afbee4c795d5305827f3ce533f3d04474a18eb

Comment by Qian Yingjin [ 20/Feb/23 ]

I think it is an existing bug exposed by that change.

Comment by Qian Yingjin [ 10/Mar/23 ]

Hi Patrick,

They are all not applied to ES5.2.

Comment by Gerrit Updater [ 13/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50278
Subject: LU-16579 llite: fix the wrong beyond read end calculation
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 88e538a9c6f7444fde284960d56ed62d4f17cb3a

Comment by Gerrit Updater [ 21/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50065/
Subject: LU-16579 llite: fix the wrong beyond read end calculation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ae356dc325877bd130ad94acc5f3610898de8a8a

Comment by Peter Jones [ 21/Mar/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50278/
Subject: LU-16579 llite: fix the wrong beyond read end calculation
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9c8a80bca738884e09affd66837b9e94508664d1

Generated at Sat Feb 10 03:28:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.