Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0
-
ZFS OSDs
Description
Despite our parity checking hardware RAID on Grove we appear to have run in to a case where ZFS is getting bad block data from disk. The root cause for this still isn't clear and we're looking in to it.
However, it clearly exposed that right now the ZFS OSD doesn't even try to handle IO errors on read from the DMU. Lustre hit the following assertion when ZFS returned the IO error. We need to update osd_bufs_get_read() to handle the error and return it up the stack.
<ConMan> Console [grove250] log at 2013-03-17 23:00:00 PDT. 2013-03-17 23:50:10 LustreError: 7462:0:(osd_io.c:276:osd_bufs_get_read()) ASSERTION( rc == 0 ) failed: 2013-03-17 23:50:10 LustreError: 7462:0:(osd_io.c:276:osd_bufs_get_read()) LBUG 2013-03-17 23:50:10 Pid: 7462, comm: ll_ost_io00_060 2013-03-17 23:50:10 2013-03-17 23:50:10 Call Trace: 2013-03-17 23:50:10 [<ffffffffa0346965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-03-17 23:50:10 [<ffffffffa0346f77>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-03-17 23:50:10 [<ffffffffa0d36796>] osd_bufs_get+0x996/0xa10 [osd_zfs] 2013-03-17 23:50:10 [<ffffffffa06cc386>] ? lu_object_find+0x16/0x20 [obdclass] 2013-03-17 23:50:10 [<ffffffffa0dd540f>] ofd_preprw_read+0x13f/0x850 [ofd] 2013-03-17 23:50:10 [<ffffffffa0dd6073>] ofd_preprw+0x553/0x12b0 [ofd] 2013-03-17 23:50:10 [<ffffffffa0d9030c>] obd_preprw+0x12c/0x3d0 [ost] 2013-03-17 23:50:10 [<ffffffffa0d95af4>] ost_brw_read+0xd14/0x12f0 [ost] 2013-03-17 23:50:10 [<ffffffff8126c489>] ? cpumask_next_and+0x29/0x50 2013-03-17 23:50:10 [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0 2013-03-17 23:50:10 [<ffffffffa085d52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa085d688>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa0d9c658>] ost_handle+0x2a68/0x46a0 [ost] 2013-03-17 23:50:10 [<ffffffffa0864c2b>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa086d08c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa03476be>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-03-17 23:50:10 [<ffffffffa035914f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2013-03-17 23:50:10 [<ffffffffa0864459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-03-17 23:50:10 [<ffffffff81051ba3>] ? __wake_up+0x53/0x70 2013-03-17 23:50:10 [<ffffffffa086e625>] ptlrpc_main+0xbb5/0x1970 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2013-03-17 23:50:10 [<ffffffff8100c14a>] child_rip+0xa/0x20 2013-03-17 23:50:10 [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2013-03-17 23:50:10 [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 2013-03-17 23:50:10 [<ffffffff8100c140>] ? child_rip+0x0/0x20