Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5278

ZFS - many OST watchdogs with IOR

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.6.0
    • Hyperion/LLNL
    • 3
    • 14730

    Description

      Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs.
      Example:

      2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014
      2014-07-01 08:22:47
      2014-07-01 08:22:47 Call Trace:
      2014-07-01 08:22:47  [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs]
      2014-07-01 08:22:47  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
      2014-07-01 08:22:47  [<ffffffff815287f3>] io_schedule+0x73/0xc0
      2014-07-01 08:22:47  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
      2014-07-01 08:22:47  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
      2014-07-01 08:22:47  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
      2014-07-01 08:22:47  [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs]
      2014-07-01 08:22:47  [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
      2014-07-01 08:22:47  [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2014-07-01 08:22:47  [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
      2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 08:22:47  [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd]
      2014-07-01 08:22:47  [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd]
      2014-07-01 08:22:47  [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 08:22:47  [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      2014-07-01 08:22:47  [<ffffffff8109ab56>] kthread+0x96/0xa0
      2014-07-01 08:22:47  [<ffffffff8100c20a>] child_rip+0xa/0x20
      2014-07-01 08:22:47  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      2014-07-01 08:22:47  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Lustre dump attached.

      Second example:

      2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070
      2014-07-01 09:38:41
      2014-07-01 09:38:41 Call Trace:
      2014-07-01 09:38:41  [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs]
      2014-07-01 09:38:41  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
      2014-07-01 09:38:41  [<ffffffff815287f3>] io_schedule+0x73/0xc0
      2014-07-01 09:38:41  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
      2014-07-01 09:38:41  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
      2014-07-01 09:38:41  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
      2014-07-01 09:38:41  [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
      2014-07-01 09:38:41  [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2014-07-01 09:38:41  [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
      2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 09:38:41  [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd]
      2014-07-01 09:38:41  [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd]
      2014-07-01 09:38:41  [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
      2014-07-01 09:38:41  [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
      2014-07-01 09:38:41  [<ffffffff8109ab56>] kthread+0x96/0xa0
      2014-07-01 09:38:41  [<ffffffff8100c20a>] child_rip+0xa/0x20
      2014-07-01 09:38:41  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      2014-07-01 09:38:41  [<ffffffff8100c200>] ? child_rip+0x0/0x20
      2014-07-01 09:38:41
      

      Attachments

        1. Hyperion Performance 17 Nov 2014.xlsx
          132 kB
          Cliff White
        2. ior.iws28.txt.gz
          0.2 kB
          Cliff White
        3. iws24.dump.txt.gz
          0.2 kB
          Cliff White
        4. iws28.dump.txt.gz
          0.2 kB
          Cliff White
        5. lustre-log.1429199475.64826.txt.gz
          0.3 kB
          Cliff White
        6. proc_spl_MDS.tgz
          3.89 MB
          Cliff White
        7. proc_spl.tgz
          3.99 MB
          Cliff White

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: