Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3109

ZFS - very slow reads, OST watchdogs

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: Lustre 2.4.0
    • Fix Version/s: None
    • Labels:
    • Environment:
      Hyperion/LLNL
    • Severity:
      3
    • Rank (Obsolete):
      7557

      Description

      Running ior file-per-process. we observe one or two of the OSTs have a very excessive load factor compared to the others (load average of 112 vs LA of 0.1)
      System log shows large number of watchdogs. IO is not failing but rates are very, very slow.
      First watchdog (log attached)

      2013-04-04 11:26:05 LNet: Service thread pid 8074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2013-04-04 11:26:05 Pid: 8074, comm: ll_ost_io00_018
      2013-04-04 11:26:05
      2013-04-04 11:26:05 Call Trace:
      2013-04-04 11:26:05  [<ffffffffa056cd40>] ? arc_read_nolock+0x530/0x810 [zfs]
      2013-04-04 11:26:05  [<ffffffffa04e45ac>] cv_wait_common+0x9c/0x1a0 [spl]
      2013-04-04 11:26:05  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      2013-04-04 11:26:05  [<ffffffffa04e46e3>] __cv_wait+0x13/0x20 [spl]
      2013-04-04 11:26:05  [<ffffffffa060633b>] zio_wait+0xeb/0x160 [zfs]
      2013-04-04 11:26:05  [<ffffffffa057106d>] dbuf_read+0x3fd/0x720 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0572c1b>] dbuf_prefetch+0x10b/0x2b0 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0586381>] dmu_zfetch_dofetch+0xf1/0x160 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0570280>] ? dbuf_read_done+0x0/0x110 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0587211>] dmu_zfetch+0xaa1/0xe40 [zfs]
      2013-04-04 11:26:06  [<ffffffffa05710fa>] dbuf_read+0x48a/0x720 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0578bc9>] dmu_buf_hold_array_by_dnode+0x179/0x570 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0579b28>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0d4c95d>] osd_bufs_get+0x49d/0x9a0 [osd_zfs]
      2013-04-04 11:26:06  [<ffffffff81270f7c>] ? put_dec+0x10c/0x110
      2013-04-04 11:26:06  [<ffffffffa0723736>] ? lu_object_find+0x16/0x20 [obdclass]
      2013-04-04 11:26:06  [<ffffffffa0ded49f>] ofd_preprw_read+0x13f/0x7e0 [ofd]
      2013-04-04 11:26:06  [<ffffffffa0dedec5>] ofd_preprw+0x385/0x1190 [ofd]
      2013-04-04 11:26:06  [<ffffffffa0da739c>] obd_preprw+0x12c/0x3d0 [ost]
      2013-04-04 11:26:06  [<ffffffffa0dace80>] ost_brw_read+0xd00/0x12e0 [ost]
      2013-04-04 11:26:06  [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0
      2013-04-04 11:26:06  [<ffffffffa035127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa0361bdb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08a2f4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08a30a8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa0db3a63>] ost_handle+0x2b53/0x46f0 [ost]
      2013-04-04 11:26:06  [<ffffffffa035e0e4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08b21ac>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa03525de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08a97e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff81052223>] ? __wake_up+0x53/0x70
      2013-04-04 11:26:06  [<ffffffffa08b36f5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-04-04 11:26:06
      2013-04-04 11:26:06 LustreError: dumping log to /tmp/lustre-log.1365099965.8074
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wc-triage WC Triage
                Reporter:
                cliffw Cliff White (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: