Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3109

ZFS - very slow reads, OST watchdogs

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0
    • Hyperion/LLNL
    • 3
    • 7557

    Description

      Running ior file-per-process. we observe one or two of the OSTs have a very excessive load factor compared to the others (load average of 112 vs LA of 0.1)
      System log shows large number of watchdogs. IO is not failing but rates are very, very slow.
      First watchdog (log attached)

      2013-04-04 11:26:05 LNet: Service thread pid 8074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2013-04-04 11:26:05 Pid: 8074, comm: ll_ost_io00_018
      2013-04-04 11:26:05
      2013-04-04 11:26:05 Call Trace:
      2013-04-04 11:26:05  [<ffffffffa056cd40>] ? arc_read_nolock+0x530/0x810 [zfs]
      2013-04-04 11:26:05  [<ffffffffa04e45ac>] cv_wait_common+0x9c/0x1a0 [spl]
      2013-04-04 11:26:05  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      2013-04-04 11:26:05  [<ffffffffa04e46e3>] __cv_wait+0x13/0x20 [spl]
      2013-04-04 11:26:05  [<ffffffffa060633b>] zio_wait+0xeb/0x160 [zfs]
      2013-04-04 11:26:05  [<ffffffffa057106d>] dbuf_read+0x3fd/0x720 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0572c1b>] dbuf_prefetch+0x10b/0x2b0 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0586381>] dmu_zfetch_dofetch+0xf1/0x160 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0570280>] ? dbuf_read_done+0x0/0x110 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0587211>] dmu_zfetch+0xaa1/0xe40 [zfs]
      2013-04-04 11:26:06  [<ffffffffa05710fa>] dbuf_read+0x48a/0x720 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0578bc9>] dmu_buf_hold_array_by_dnode+0x179/0x570 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0579b28>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
      2013-04-04 11:26:06  [<ffffffffa0d4c95d>] osd_bufs_get+0x49d/0x9a0 [osd_zfs]
      2013-04-04 11:26:06  [<ffffffff81270f7c>] ? put_dec+0x10c/0x110
      2013-04-04 11:26:06  [<ffffffffa0723736>] ? lu_object_find+0x16/0x20 [obdclass]
      2013-04-04 11:26:06  [<ffffffffa0ded49f>] ofd_preprw_read+0x13f/0x7e0 [ofd]
      2013-04-04 11:26:06  [<ffffffffa0dedec5>] ofd_preprw+0x385/0x1190 [ofd]
      2013-04-04 11:26:06  [<ffffffffa0da739c>] obd_preprw+0x12c/0x3d0 [ost]
      2013-04-04 11:26:06  [<ffffffffa0dace80>] ost_brw_read+0xd00/0x12e0 [ost]
      2013-04-04 11:26:06  [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0
      2013-04-04 11:26:06  [<ffffffffa035127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa0361bdb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08a2f4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08a30a8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa0db3a63>] ost_handle+0x2b53/0x46f0 [ost]
      2013-04-04 11:26:06  [<ffffffffa035e0e4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08b21ac>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa03525de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      2013-04-04 11:26:06  [<ffffffffa08a97e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff81052223>] ? __wake_up+0x53/0x70
      2013-04-04 11:26:06  [<ffffffffa08b36f5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
      2013-04-04 11:26:06  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-04-04 11:26:06
      2013-04-04 11:26:06 LustreError: dumping log to /tmp/lustre-log.1365099965.8074
      

      Attachments

        Issue Links

          Activity

            [LU-3109] ZFS - very slow reads, OST watchdogs
            adilger Andreas Dilger made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-5278 [ LU-5278 ]
            jessica Jessica A. Popp (Inactive) made changes -
            Priority Original: Major [ 3 ] New: Blocker [ 1 ]
            utopiabound Nathaniel Clark made changes -
            Labels Original: prz zfs New: performance prz zfs
            pjones Peter Jones made changes -
            Labels Original: zfs New: prz zfs
            utopiabound Nathaniel Clark made changes -
            Link New: This issue is related to LU-2887 [ LU-2887 ]
            utopiabound Nathaniel Clark made changes -
            Labels New: zfs
            cliffw Cliff White (Inactive) made changes -
            Attachment New: agb13.lustre.log.gz [ 12480 ]
            cliffw Cliff White (Inactive) created issue -

            People

              wc-triage WC Triage
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: