[LU-5263] ll_read_ahead_pages() ASSERTION( page_idx > ria->ria_stoff ) Created: 27/Jun/14  Updated: 18/Aug/14  Resolved: 13/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: Lustre 2.7.0, Lustre 2.5.3

Type: Bug Priority: Minor
Reporter: Bruno Travouillon (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 6 w/ kernel 2.6.32_220.23.1


Attachments: File 140505-BS0250_NFS_Ganesha.tar.gz    
Issue Links:
Duplicate
duplicates LU-4192 NFS server crash while fsx runs on cl... Resolved
Severity: 3
Rank (Obsolete): 14689

 Description   

One Lustre client frequently crash on LBUG with the ASSERTION( page_idx > ria->ria_stoff ).. (13 crashes in the past 3 months)

This Lustre client acts as a nfs server and exports Lustre to a web server through nfs-ganesha.

----8< ----
[24937.600920] Lustre: DEBUG MARKER: Thu Jun 5 20:00:01 2014
[24937.600921]
[24950.667750] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) ASSERTION( page_idx > ria->ria_stoff ) failed: Invalid page_idx 234497rs 234497 re 300287 ro 234751 rl 256 rp 1
[24950.683642] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) LBUG
[24950.690154] Pid: 4126, comm: ganesha.nfsd
[24950.695572]
[24950.695573] Call Trace:
[24950.702337] [<ffffffffa05697f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[24950.710703] [<ffffffffa0569e07>] lbug_with_loc+0x47/0xb0 [libcfs]
[24950.718317] [<ffffffffa0bb511f>] ll_readahead+0x10cf/0x1100 [lustre]
[24950.726131] [<ffffffffa0bdc805>] vvp_io_read_page+0x305/0x360 [lustre]
[24950.734159] [<ffffffffa068eb4d>] cl_io_read_page+0x8d/0x170 [obdclass]
[24950.742108] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[24950.750166] [<ffffffffa0bb5746>] ll_readpage+0x96/0x200 [lustre]
[24950.757661] [<ffffffff810ff88c>] generic_file_aio_read+0x1fc/0x700
[24950.765360] [<ffffffff810816ff>] ? up+0x2f/0x50
[24950.771433] [<ffffffffa0bdce1b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
[24950.779575] [<ffffffffa068cb4a>] cl_io_start+0x6a/0x140 [obdclass]
[24950.787244] [<ffffffffa0690e2c>] cl_io_loop+0xcc/0x190 [obdclass]
[24950.794848] [<ffffffffa0b8d097>] ll_file_io_generic+0x3a7/0x560 [lustre]
[24950.803059] [<ffffffffa0b8d389>] ll_file_aio_read+0x139/0x2c0 [lustre]
[24950.811092] [<ffffffffa0b8d849>] ll_file_read+0x169/0x2a0 [lustre]
[24950.818784] [<ffffffff81164525>] vfs_read+0xb5/0x1a0
[24950.825275] [<ffffffff81164852>] sys_pread64+0x82/0xa0
[24950.831917] [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
[24950.839353]
[24950.842659] Kernel panic - not syncing: LBUG
[24950.847986] Pid: 4126, comm: ganesha.nfsd Not tainted 2.6.32-220.23.1.bl6.Bull.28.10.x86_64 #1
[24950.858073] Call Trace:
[24950.861907] [<ffffffff814851a0>] ? panic+0x78/0x143
[24950.868308] [<ffffffffa0569e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[24950.876096] [<ffffffffa0bb511f>] ? ll_readahead+0x10cf/0x1100 [lustre]
[24950.884141] [<ffffffffa0bdc805>] ? vvp_io_read_page+0x305/0x360 [lustre]
[24950.892371] [<ffffffffa068eb4d>] ? cl_io_read_page+0x8d/0x170 [obdclass]
[24950.900562] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[24950.908676] [<ffffffffa0bb5746>] ? ll_readpage+0x96/0x200 [lustre]
[24950.916357] [<ffffffff810ff88c>] ? generic_file_aio_read+0x1fc/0x700
[24950.924221] [<ffffffff810816ff>] ? up+0x2f/0x50
[24950.930283] [<ffffffffa0bdce1b>] ? vvp_io_read_start+0x13b/0x3e0 [lustre]
[24950.938586] [<ffffffffa068cb4a>] ? cl_io_start+0x6a/0x140 [obdclass]
[24950.946448] [<ffffffffa0690e2c>] ? cl_io_loop+0xcc/0x190 [obdclass]
[24950.954218] [<ffffffffa0b8d097>] ? ll_file_io_generic+0x3a7/0x560 [lustre]
[24950.962601] [<ffffffffa0b8d389>] ? ll_file_aio_read+0x139/0x2c0 [lustre]
[24950.970813] [<ffffffffa0b8d849>] ? ll_file_read+0x169/0x2a0 [lustre]
[24950.978649] [<ffffffff81164525>] ? vfs_read+0xb5/0x1a0
[24950.985288] [<ffffffff81164852>] ? sys_pread64+0x82/0xa0
[24950.992092] [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
----8< ----

We asked the customer to add read ahead to the debug log. (lctl set_param debug=+reada)

The debug log is available in the attached support bundle (from crash 2014-06-05-20:01:29).

Read-ahead settings:
----8< ----

  1. lctl get_param llite..max_read_ahead
    llite.store1-ffff88120ee93c00.max_read_ahead_mb=40
    llite.store1-ffff88120ee93c00.max_read_ahead_per_file_mb=40
    llite.store1-ffff88120ee93c00.max_read_ahead_whole_mb=2
    ----8< ----

We asked the customer to take a look at the web server logs to see which files are accessed at the time of the crash. This is never the same file.

It looks like LU-4192



 Comments   
Comment by Peter Jones [ 27/Jun/14 ]

Bobijam

Could you please advise on this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 01/Jul/14 ]

would you please try this patch http://review.whamcloud.com/10914 ?

Comment by Bruno Travouillon (Inactive) [ 08/Jul/14 ]

Hi bobijam,

The patch is in test since this morning. We should be able to give you a feedback in the next two weeks.

Thanks,

Bruno

Comment by Bruno Travouillon (Inactive) [ 22/Jul/14 ]

We had no issue with the Lustre fs since the last two weeks.

bobijam, can we safely add this patch into our 2.1.6 branch?

Comment by Zhenyu Xu [ 22/Jul/14 ]

I think yes, and I'll try to make it land to b2_1 branch as well.

Comment by Peter Jones [ 22/Jul/14 ]

Bobi

It's great to hear that this patch has held up so well in testing. Our usual practice before Bull deploy things into production is to ensure that we have at least two reviews and Oleg has signed off on it. We would only land it to the b2_1 branch if/when we create a 2.1.7 release.

Regards

Peter

Comment by Zhenyu Xu [ 25/Jul/14 ]

patch for master is tracked at http://review.whamcloud.com/#/c/11181/

Comment by Peter Jones [ 13/Aug/14 ]

Landed for 2.7. Will track landing to maintenance releases separately

Comment by Jian Yu [ 14/Aug/14 ]

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/11455

Generated at Sat Feb 10 01:49:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.