Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.4.3
-
RHEL 6 w/ patched kernel for Lustre
-
3
-
15009
Description
We are hitting this LBUG on one of our production systems recently updated to Lustre 2.4.3 (on 4th of June).
LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed: LustreError: 11020:0:(ost_handler.c:882:ost_brw_read()) LBUG Pid: 11020, comm: ll_ost_io03_084 Call Trace: [<ffffffffa07e6895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa07e6e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa04a0d44>] ost_brw_read+0x12d4/0x1340 [ost] [<ffffffff81282c09>] ? cpumask_next_and+0x29/0x50 [<ffffffff8105bf64>] ? find_busiest_group+0x244/0x9f0 [<ffffffffa0abbf0c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] [<ffffffffa0abc068>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] [<ffffffffa04a8038>] ost_handle+0x2ac8/0x48e0 [ost] [<ffffffffa0ac2c4b>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] [<ffffffffa0acb428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa07e75de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa07f8d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa0ac2789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81058bd3>] ? __wake_up+0x53/0x70 [<ffffffffa0acc7be>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0acbcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c200>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG
The assertion is here:
753 static int ost_brw_read(struct ptlrpc_request *req, struct obd_trans_info *oti) 754 { ... 879 if (page_rc != local_nb[i].len) { /* short read */ 880 /* All subsequent pages should be 0 */ 881 while(++i < npages) 882 LASSERT(local_nb[i].rc == 0); 883 break; 884 }
I was able to get the content of local_nb from the crash dump.
crash> struct ost_thread_local_cache 0xffff88105da00000
struct ost_thread_local_cache {
local = {{
lnb_file_offset = 6010437632,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea0037309c10,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 4096
}, {
...
}, {
lnb_file_offset = 6010757120,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea00372ccf80,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 512 <======== local_nb[i].rc != local_nb[i].len /* short read */
}, {
lnb_file_offset = 6010761216,
lnb_page_offset = 0,
len = 4096,
flags = 0,
page = 0xffffea0037176e98,
dentry = 0xffff8808048c9a80,
lnb_grant_used = 0,
rc = 4096
}, {
lnb_file_offset = 1411710976,
lnb_page_offset = 0,
len = 4096,
flags = 1120,
page = 0x0,
dentry = 0xffff8803690c2780,
lnb_grant_used = 0,
rc = 0
}, {
lnb_file_offset = 1411715072,
lnb_page_offset = 0,
len = 4096,
flags = 1120,
page = 0x0,
dentry = 0xffff8803690c2780,
lnb_grant_used = 0,
rc = 0
}, {
...
This LBUG occurred 5 times since 06/06/14. Each time, we have a short read followed by some non-empty pages (rc != 0). You will find attached some output for the first 3 occurences.
FYI, we have Lustre routers between the servers and clients as they are on a different LNET. Clients, routers and servers are running Lustre 2.4.3.
Attachments
Issue Links
- is duplicated by
-
LU-7322 (ost_handler.c:896:ost_brw_read()) ASSERTION( local_nb[i].rc == 0 ) failed
- Resolved