Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0
    • Lustre 2.4.0
    • ZFS OSDs
    • 1
    • 3
    • 7270

    Description

      Despite our parity checking hardware RAID on Grove we appear to have run in to a case where ZFS is getting bad block data from disk. The root cause for this still isn't clear and we're looking in to it.

      However, it clearly exposed that right now the ZFS OSD doesn't even try to handle IO errors on read from the DMU. Lustre hit the following assertion when ZFS returned the IO error. We need to update osd_bufs_get_read() to handle the error and return it up the stack.

      <ConMan> Console [grove250] log at 2013-03-17 23:00:00 PDT.
      2013-03-17 23:50:10 LustreError: 7462:0:(osd_io.c:276:osd_bufs_get_read()) ASSERTION( rc == 0 ) failed: 
      2013-03-17 23:50:10 LustreError: 7462:0:(osd_io.c:276:osd_bufs_get_read()) LBUG
      2013-03-17 23:50:10 Pid: 7462, comm: ll_ost_io00_060
      2013-03-17 23:50:10
      2013-03-17 23:50:10 Call Trace:
      2013-03-17 23:50:10  [<ffffffffa0346965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-03-17 23:50:10  [<ffffffffa0346f77>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-03-17 23:50:10  [<ffffffffa0d36796>] osd_bufs_get+0x996/0xa10 [osd_zfs]
      2013-03-17 23:50:10  [<ffffffffa06cc386>] ? lu_object_find+0x16/0x20 [obdclass]
      2013-03-17 23:50:10  [<ffffffffa0dd540f>] ofd_preprw_read+0x13f/0x850 [ofd]
      2013-03-17 23:50:10  [<ffffffffa0dd6073>] ofd_preprw+0x553/0x12b0 [ofd]
      2013-03-17 23:50:10  [<ffffffffa0d9030c>] obd_preprw+0x12c/0x3d0 [ost]
      2013-03-17 23:50:10  [<ffffffffa0d95af4>] ost_brw_read+0xd14/0x12f0 [ost]
      2013-03-17 23:50:10  [<ffffffff8126c489>] ? cpumask_next_and+0x29/0x50
      2013-03-17 23:50:10  [<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0
      2013-03-17 23:50:10  [<ffffffffa085d52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa085d688>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa0d9c658>] ost_handle+0x2a68/0x46a0 [ost]
      2013-03-17 23:50:10  [<ffffffffa0864c2b>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa086d08c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa03476be>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2013-03-17 23:50:10  [<ffffffffa035914f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
      2013-03-17 23:50:10  [<ffffffffa0864459>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
      2013-03-17 23:50:10  [<ffffffffa086e625>] ptlrpc_main+0xbb5/0x1970 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2013-03-17 23:50:10  [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffffa086da70>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
      2013-03-17 23:50:10  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Activity

          [LU-2983] ASSERTION in osd_bufs_get_read()

          landed

          bzzz Alex Zhuravlev added a comment - landed
          bzzz Alex Zhuravlev added a comment - http://review.whamcloud.com/5784

          Related to this we're trying to map the ZFS object number from the OST (which has a bad checksum) back to the full Lustre path for the file. What's the right way to go about this these days?

          behlendorf Brian Behlendorf added a comment - Related to this we're trying to map the ZFS object number from the OST (which has a bad checksum) back to the full Lustre path for the file. What's the right way to go about this these days?
          pjones Peter Jones added a comment -

          Alex will look into this one

          pjones Peter Jones added a comment - Alex will look into this one

          People

            bzzz Alex Zhuravlev
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: