Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.6
-
None
-
RHEL 6 w/ kernel 2.6.32_220.23.1
-
3
-
14689
Description
One Lustre client frequently crash on LBUG with the ASSERTION( page_idx > ria->ria_stoff ).. (13 crashes in the past 3 months)
This Lustre client acts as a nfs server and exports Lustre to a web server through nfs-ganesha.
----8< ----
[24937.600920] Lustre: DEBUG MARKER: Thu Jun 5 20:00:01 2014
[24937.600921]
[24950.667750] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) ASSERTION( page_idx > ria->ria_stoff ) failed: Invalid page_idx 234497rs 234497 re 300287 ro 234751 rl 256 rp 1
[24950.683642] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) LBUG
[24950.690154] Pid: 4126, comm: ganesha.nfsd
[24950.695572]
[24950.695573] Call Trace:
[24950.702337] [<ffffffffa05697f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[24950.710703] [<ffffffffa0569e07>] lbug_with_loc+0x47/0xb0 [libcfs]
[24950.718317] [<ffffffffa0bb511f>] ll_readahead+0x10cf/0x1100 [lustre]
[24950.726131] [<ffffffffa0bdc805>] vvp_io_read_page+0x305/0x360 [lustre]
[24950.734159] [<ffffffffa068eb4d>] cl_io_read_page+0x8d/0x170 [obdclass]
[24950.742108] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[24950.750166] [<ffffffffa0bb5746>] ll_readpage+0x96/0x200 [lustre]
[24950.757661] [<ffffffff810ff88c>] generic_file_aio_read+0x1fc/0x700
[24950.765360] [<ffffffff810816ff>] ? up+0x2f/0x50
[24950.771433] [<ffffffffa0bdce1b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
[24950.779575] [<ffffffffa068cb4a>] cl_io_start+0x6a/0x140 [obdclass]
[24950.787244] [<ffffffffa0690e2c>] cl_io_loop+0xcc/0x190 [obdclass]
[24950.794848] [<ffffffffa0b8d097>] ll_file_io_generic+0x3a7/0x560 [lustre]
[24950.803059] [<ffffffffa0b8d389>] ll_file_aio_read+0x139/0x2c0 [lustre]
[24950.811092] [<ffffffffa0b8d849>] ll_file_read+0x169/0x2a0 [lustre]
[24950.818784] [<ffffffff81164525>] vfs_read+0xb5/0x1a0
[24950.825275] [<ffffffff81164852>] sys_pread64+0x82/0xa0
[24950.831917] [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
[24950.839353]
[24950.842659] Kernel panic - not syncing: LBUG
[24950.847986] Pid: 4126, comm: ganesha.nfsd Not tainted 2.6.32-220.23.1.bl6.Bull.28.10.x86_64 #1
[24950.858073] Call Trace:
[24950.861907] [<ffffffff814851a0>] ? panic+0x78/0x143
[24950.868308] [<ffffffffa0569e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[24950.876096] [<ffffffffa0bb511f>] ? ll_readahead+0x10cf/0x1100 [lustre]
[24950.884141] [<ffffffffa0bdc805>] ? vvp_io_read_page+0x305/0x360 [lustre]
[24950.892371] [<ffffffffa068eb4d>] ? cl_io_read_page+0x8d/0x170 [obdclass]
[24950.900562] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[24950.908676] [<ffffffffa0bb5746>] ? ll_readpage+0x96/0x200 [lustre]
[24950.916357] [<ffffffff810ff88c>] ? generic_file_aio_read+0x1fc/0x700
[24950.924221] [<ffffffff810816ff>] ? up+0x2f/0x50
[24950.930283] [<ffffffffa0bdce1b>] ? vvp_io_read_start+0x13b/0x3e0 [lustre]
[24950.938586] [<ffffffffa068cb4a>] ? cl_io_start+0x6a/0x140 [obdclass]
[24950.946448] [<ffffffffa0690e2c>] ? cl_io_loop+0xcc/0x190 [obdclass]
[24950.954218] [<ffffffffa0b8d097>] ? ll_file_io_generic+0x3a7/0x560 [lustre]
[24950.962601] [<ffffffffa0b8d389>] ? ll_file_aio_read+0x139/0x2c0 [lustre]
[24950.970813] [<ffffffffa0b8d849>] ? ll_file_read+0x169/0x2a0 [lustre]
[24950.978649] [<ffffffff81164525>] ? vfs_read+0xb5/0x1a0
[24950.985288] [<ffffffff81164852>] ? sys_pread64+0x82/0xa0
[24950.992092] [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
----8< ----
We asked the customer to add read ahead to the debug log. (lctl set_param debug=+reada)
The debug log is available in the attached support bundle (from crash 2014-06-05-20:01:29).
Read-ahead settings:
----8< ----
- lctl get_param llite..max_read_ahead
llite.store1-ffff88120ee93c00.max_read_ahead_mb=40
llite.store1-ffff88120ee93c00.max_read_ahead_per_file_mb=40
llite.store1-ffff88120ee93c00.max_read_ahead_whole_mb=2
----8< ----
We asked the customer to take a look at the web server logs to see which files are accessed at the time of the crash. This is never the same file.
It looks like LU-4192
Attachments
Issue Links
- duplicates
-
LU-4192 NFS server crash while fsx runs on clients
-
- Resolved
-
Bobi
It's great to hear that this patch has held up so well in testing. Our usual practice before Bull deploy things into production is to ensure that we have at least two reviews and Oleg has signed off on it. We would only land it to the b2_1 branch if/when we create a 2.1.7 release.
Regards
Peter