Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5278

ZFS - many OST watchdogs with IOR

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0
    • Hyperion/LLNL
    • 3
    • 14730

    Description

      Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs.
      Example:

      2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014
      2014-07-01 08:22:47
      2014-07-01 08:22:47 Call Trace:
      2014-07-01 08:22:47  [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs]
      2014-07-01 08:22:47  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
      2014-07-01 08:22:47  [<ffffffff815287f3>] io_schedule+0x73/0xc0
      2014-07-01 08:22:47  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
      2014-07-01 08:22:47  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
      2014-07-01 08:22:47  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
      2014-07-01 08:22:47  [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs]
      2014-07-01 08:22:47  [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
      2014-07-01 08:22:47  [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2014-07-01 08:22:47  [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
      2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 08:22:47  [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd]
      2014-07-01 08:22:47  [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd]
      2014-07-01 08:22:47  [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 08:22:47  [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffff8109ab56>] kthread+0x96/0xa0
      2014-07-01 08:22:47  [<ffffffff8100c20a>] child_rip+0xa/0x20
      2014-07-01 08:22:47  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      2014-07-01 08:22:47  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Lustre dump attached.

      Second example:

      2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070
      2014-07-01 09:38:41
      2014-07-01 09:38:41 Call Trace:
      2014-07-01 09:38:41  [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs]
      2014-07-01 09:38:41  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
      2014-07-01 09:38:41  [<ffffffff815287f3>] io_schedule+0x73/0xc0
      2014-07-01 09:38:41  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
      2014-07-01 09:38:41  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
      2014-07-01 09:38:41  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
      2014-07-01 09:38:41  [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2014-07-01 09:38:41  [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
      2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 09:38:41  [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd]
      2014-07-01 09:38:41  [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd]
      2014-07-01 09:38:41  [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 09:38:41  [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffff8109ab56>] kthread+0x96/0xa0
      2014-07-01 09:38:41  [<ffffffff8100c20a>] child_rip+0xa/0x20
      2014-07-01 09:38:41  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      2014-07-01 09:38:41  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      2014-07-01 09:38:41
      

      Attachments

        1. Hyperion Performance 17 Nov 2014.xlsx
          132 kB
        2. ior.iws28.txt.gz
          0.2 kB
        3. iws24.dump.txt.gz
          0.2 kB
        4. iws28.dump.txt.gz
          0.2 kB
        5. lustre-log.1429199475.64826.txt.gz
          0.3 kB
        6. proc_spl_MDS.tgz
          3.89 MB
        7. proc_spl.tgz
          3.99 MB

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: