Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7987

Lustre 2.8 OSS with zfs 0.6.5 backend hitting most schedule_timeout

Details

    • Bug
    • Resolution: Low Priority
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While trying to test for LU-7404 OST timeout the one of the OSS nodes became unaccessible via ssh and conman.

      This was running with 2 OSS nodes running 1 zfs zpool on each, 1 mds, and 4 clients running IOR. The first message that was seen was fairly soon after starting. About 2 hours after running lustre stopped responding to lustre clients and static scripts running on the OSS stopped running.

      First message seen was:
      Apr 4 21:06:29 wolf-4 kernel: LNet: Service thread pid 12256 was inactive for 200.30s. The thread might be hung, or it might only be slow an
      d will resume later. Dumping the stack trace for debugging purposes:

      Apr  4 21:06:29 wolf-4 kernel: Pid: 12256, comm: ll_ost02_000
      Apr  4 21:06:29 wolf-4 kernel: #012Call Trace:
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a9e9>] schedule+0x29/0x70
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff816386d9>] schedule_timeout+0x209/0x2d0
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa012dc0c>] ? vdev_mirror_io_start+0xac/0x190 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa012d150>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8101c829>] ? read_tsc+0x9/0x10
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810d814c>] ? ktime_get_ts64+0x4c/0xf0
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a01e>] io_schedule_timeout+0xae/0x130
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a0b8>] io_schedule+0x18/0x20
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00116e0>] cv_wait_common+0xb0/0x150 [spl]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00117d8>] __cv_wait_io+0x18/0x20 [spl]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa016db83>] zio_wait+0x123/0x210 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00c44ad>] dbuf_read+0x30d/0x930 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00c60eb>] dmu_buf_will_dirty+0x4b/0xa0 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00deb46>] dnode_dirty_l1+0x36/0x50 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00e146a>] dnode_free_range+0x4da/0x6b0 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00cb829>] dmu_free_long_range+0x1d9/0x280 [zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa11fe468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa11fe7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa120112a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12fef9f>] ofd_trans_stop+0x1f/0x60 [ofd]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa1302bb2>] ofd_object_destroy+0x2b2/0x890 [ofd]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12fb8f7>] ofd_destroy_by_fid+0x307/0x510 [ofd]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa0fce170>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa0fc8730>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12edc67>] ofd_destroy_hdl+0x267/0xa50 [ofd]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa105af65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa1007beb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa0cec578>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa1006498>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa100bc90>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa100b200>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff81645998>] ret_from_fork+0x58/0x90
      Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
      Apr  4 21:06:30 wolf-4 kernel: 
      Apr  4 21:06:30 wolf-4 kernel: LustreError: dumping log to /tmp/lustre-log.1459803990.12256
      Apr  4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied
      Apr  4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied
      Apr  4 21:07:19 wolf-4 kernel: LNet: Service thread pid 12256 completed after 249.57s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 
      

      But the system appeared to keep running past that and then come to a grinding halt.

      Run began: Mon Apr  4 23:25:07 2016
      Command line used: /usr/bin/IOR -i 3 -v -b 50G -t 128k
      Machine: Linux wolf-6.wolf.hpdd.intel.com
      Start time skew across all tasks: 8.97 sec
      Path: /mnt/lustre
      FS: 37.9 TiB   Used FS: 0.6%   Inodes: 171.8 Mi   Used Inodes: 32.0%
      Participating tasks: 4
      
      Summary:
              api                = POSIX
              test filename      = testFile
              access             = single-shared-file
              pattern            = segmented (1 segment)
              ordering in a file = sequential offsets
              ordering inter file= no tasks offsets
              clients            = 4 (1 per node)
              repetitions        = 3
              xfersize           = 131072 bytes
              blocksize          = 50 GiB
              aggregate filesize = 200 GiB
      

      No messages appeared on wolf-4 OST1 but it was noticed by clients:

      Apr  4 23:37:12 wolf-6 kernel: Lustre: 2811:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813025/real 0]  req@ffff880d8570ad00 x1530711614167736/t0(0) o400->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 224/224 e 0 to 1 dl 1459813032 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Apr  4 23:37:12 wolf-6 kernel: Lustre: coral-OST0001-osc-ffff881030662800: Connection to coral-OST0001 (at 192.168.1.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Apr  4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1459812992/real 1459812992]  req@ffff880266e0bc00 x1530711614167672/t0(0) o3->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:6/4 lens 608/432 e 0 to 1 dl 1459813036 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Apr  4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      Apr  4 23:37:18 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813032/real 0]  req@ffff880bbe376900 x1530711614167740/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813038 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Apr  4 23:37:37 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813057/real 1459813057]  req@ffff880bbe375d00 x1530711614167756/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813068 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Apr  4 23:38:02 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813082/real 1459813082]  req@ffff880bbe375100 x1530711614167772/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813098 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Apr  4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied
      Apr  4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied
      Apr  4 23:38:27 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813107/real 1459813107]  req@ffff880bbe374500 x1530711614167788/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813128 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Apr  4 23:38:52 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813132/real 1459813132]  req@ffff880bbe373900 x1530711614167804/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813158 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      From stats being captured on OST1:

          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:30:53     0     0      0     0    0     0    0     0    0   200M   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:31:29     0     0      0     0    0     0    0     0    0   200M   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:32:06     0     0      0     0    0     0    0     0    0   200M   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:32:42     0     0      0     0    0     0    0     0    0   200M   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:33:19     0     0      0     0    0     0    0     0    0   200M   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:33:55     0     0      0     0    0     0    0     0    0    62G   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:34:36     0     0      0     0    0     0    0     0    0    62G   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:35:18    10    10    100    10  100     0    0     0    0    62G   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:35:58    91    91    100    47  100    44  100     1  100    63G   62G  
          time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
      23:36:39     5     5    100     5  100     0    0     0    0    62G   62G  
      
      6 1 0x01 91 4368 5368758848 527202119883608
      name                            type data
      hits                            4    91379399
      misses                          4    70802556
      demand_data_hits                4    62879518
      demand_data_misses              4    52414186
      demand_metadata_hits            4    23738570
      demand_metadata_misses          4    1105675
      prefetch_data_hits              4    4757199
      prefetch_data_misses            4    17281282
      prefetch_metadata_hits          4    4112
      prefetch_metadata_misses        4    1413
      mru_hits                        4    62351920
      mru_ghost_hits                  4    917736
      mfu_hits                        4    24266169
      mfu_ghost_hits                  4    632285
      deleted                         4    116650113
      mutex_miss                      4    447
      evict_skip                      4    184
      evict_not_enough                4    62
      evict_l2_cached                 4    0
      evict_l2_eligible               4    15019191377920
      evict_l2_ineligible             4    408751927296
      evict_l2_skip                   4    0
      hash_elements                   4    1059177
      hash_elements_max               4    1126648
      hash_collisions                 4    11087755
      hash_chains                     4    31686
      hash_chain_max                  4    4
      p                               4    63241464064
      c                               4    67457282048
      c_min                           4    33554432
      c_max                           4    67457282048
      size                            4    67456702536
      hdr_size                        4    396803920
      data_size                       4    66823651328
      metadata_size                   4    76845056
      other_size                      4    159402232
      anon_size                       4    39337984
      anon_evictable_data             4    0
      anon_evictable_metadata         4    0
      mru_size                        4    64780652544
      mru_evictable_data              4    64752189440
      mru_evictable_metadata          4    327680
      mru_ghost_size                  4    2679063040
      mru_ghost_evictable_data        4    2082865152
      mru_ghost_evictable_metadata    4    596197888
      mfu_size                        4    2080505856
      mfu_evictable_data              4    2024931328
      mfu_evictable_metadata          4    3112960
      mfu_ghost_size                  4    56092389888
      mfu_ghost_evictable_data        4    56057790464
      mfu_ghost_evictable_metadata    4    34599424
      l2_hits                         4    0
      l2_misses                       4    0
      l2_feeds                        4    0
      l2_rw_clash                     4    0
      l2_read_bytes                   4    0
      l2_write_bytes                  4    0
      l2_writes_sent                  4    0
      l2_writes_done                  4    0
      l2_writes_error                 4    0
      l2_writes_lock_retry            4    0
      l2_evict_lock_retry             4    0
      l2_evict_reading                4    0
      l2_evict_l1cached               4    0
      l2_free_on_write                4    0
      l2_cdata_free_on_write          4    0
      l2_abort_lowmem                 4    0
      l2_cksum_bad                    4    0
      l2_io_error                     4    0
      l2_size                         4    0
      l2_asize                        4    0
      l2_hdr_size                     4    0
      l2_compress_successes           4    0
      l2_compress_zeros               4    0
      l2_compress_failures            4    0
      memory_throttle_count           4    0
      duplicate_buffers               4    0
      duplicate_buffers_size          4    0
      duplicate_reads                 4    0
      memory_direct_count             4    0
      memory_indirect_count           4    0
      arc_no_grow                     4    0
      arc_tempreserve                 4    0
      arc_loaned_bytes                4    0
      arc_prune                       4    0
      arc_meta_used                   4    633051832
      arc_meta_limit                  4    50592961536
      arc_meta_max                    4    809615080
      arc_meta_min                    4    16777216
      arc_need_free                   4    0
      arc_sys_free                    4    2108039168 
      
      
      proc_arcstats.2016-04-04:21:52:28:misses                          4    60302081
      proc_arcstats.2016-04-04:21:53:05:misses                          4    60302081
      proc_arcstats.2016-04-04:21:53:41:misses                          4    60302081
      proc_arcstats.2016-04-04:21:54:18:misses                          4    60302081
      proc_arcstats.2016-04-04:21:54:55:misses                          4    60302081
      proc_arcstats.2016-04-04:21:55:31:misses                          4    60302081
      proc_arcstats.2016-04-04:21:56:08:misses                          4    60302081
      proc_arcstats.2016-04-04:21:56:44:misses                          4    60302081
      proc_arcstats.2016-04-04:21:57:21:misses                          4    60302081
      proc_arcstats.2016-04-04:21:57:57:misses                          4    60302081
      proc_arcstats.2016-04-04:21:58:34:misses                          4    60302081
      proc_arcstats.2016-04-04:21:59:11:misses                          4    60302081
      proc_arcstats.2016-04-04:21:59:47:misses                          4    60302099
      proc_arcstats.2016-04-04:22:00:25:misses                          4    60302137
      proc_arcstats.2016-04-04:22:01:06:misses                          4    60374920
      proc_arcstats.2016-04-04:22:01:47:misses                          4    60513681
      proc_arcstats.2016-04-04:22:02:27:misses                          4    60644796
      proc_arcstats.2016-04-04:22:03:08:misses                          4    60782163
      proc_arcstats.2016-04-04:22:03:49:misses                          4    60921202
      proc_arcstats.2016-04-04:22:04:30:misses                          4    61049984
      proc_arcstats.2016-04-04:22:05:10:misses                          4    61131959
      proc_arcstats.2016-04-04:22:05:51:misses                          4    61133558
      proc_arcstats.2016-04-04:22:06:32:misses                          4    61203715
      proc_arcstats.2016-04-04:22:07:13:misses                          4    61317428
      proc_arcstats.2016-04-04:22:07:54:misses                          4    61406420
      proc_arcstats.2016-04-04:22:08:35:misses                          4    61520867
      proc_arcstats.2016-04-04:22:09:15:misses                          4    61659833
      proc_arcstats.2016-04-04:22:09:56:misses                          4    61796365
      proc_arcstats.2016-04-04:22:10:37:misses                          4    61905346
      proc_arcstats.2016-04-04:22:11:17:misses                          4    61977003
      proc_arcstats.2016-04-04:22:11:56:misses                          4    61977043
      proc_arcstats.2016-04-04:22:12:37:misses                          4    62083132
      proc_arcstats.2016-04-04:22:13:17:misses                          4    62215655
      proc_arcstats.2016-04-04:22:13:57:misses                          4    62348643
      proc_arcstats.2016-04-04:22:14:38:misses                          4    62494512
      proc_arcstats.2016-04-04:22:15:18:misses                          4    62630944
      proc_arcstats.2016-04-04:22:15:59:misses                          4    62727592
      proc_arcstats.2016-04-04:22:16:38:misses                          4    62806912
      proc_arcstats.2016-04-04:22:17:18:misses                          4    62809594
      proc_arcstats.2016-04-04:22:17:59:misses                          4    62832835
      proc_arcstats.2016-04-04:22:18:40:misses                          4    62874008
      proc_arcstats.2016-04-04:22:19:20:misses                          4    62915527
      proc_arcstats.2016-04-04:22:20:00:misses                          4    62956780
      proc_arcstats.2016-04-04:22:20:40:misses                          4    62998788
      proc_arcstats.2016-04-04:22:21:20:misses                          4    63041800
      proc_arcstats.2016-04-04:22:22:01:misses                          4    63083422
      proc_arcstats.2016-04-04:22:22:41:misses                          4    63125355
      proc_arcstats.2016-04-04:22:23:21:misses                          4    63167419
      proc_arcstats.2016-04-04:22:24:01:misses                          4    63210702
      proc_arcstats.2016-04-04:22:24:41:misses                          4    63252813
      proc_arcstats.2016-04-04:22:25:21:misses                          4    63295877
      proc_arcstats.2016-04-04:22:26:01:misses                          4    63339516
      proc_arcstats.2016-04-04:22:26:41:misses                          4    63383531
      proc_arcstats.2016-04-04:22:27:22:misses                          4    63417625
      proc_arcstats.2016-04-04:22:28:02:misses                          4    63418160
      proc_arcstats.2016-04-04:22:28:44:misses                          4    63418448
      proc_arcstats.2016-04-04:22:29:25:misses                          4    63446807
      proc_arcstats.2016-04-04:22:30:06:misses                          4    63605174
      proc_arcstats.2016-04-04:22:30:46:misses                          4    63787444
      proc_arcstats.2016-04-04:22:31:27:misses                          4    63991757
      proc_arcstats.2016-04-04:22:32:08:misses                          4    64186064
      proc_arcstats.2016-04-04:22:32:48:misses                          4    64370507
      proc_arcstats.2016-04-04:22:33:29:misses                          4    64544830
      proc_arcstats.2016-04-04:22:34:10:misses                          4    64705200
      proc_arcstats.2016-04-04:22:34:51:misses                          4    64890908
      proc_arcstats.2016-04-04:22:35:31:misses                          4    65051186
      proc_arcstats.2016-04-04:22:36:12:misses                          4    65093562
      proc_arcstats.2016-04-04:22:36:48:misses                          4    65093566
      proc_arcstats.2016-04-04:22:37:25:misses                          4    65093566
      proc_arcstats.2016-04-04:22:38:02:misses                          4    65093566
      proc_arcstats.2016-04-04:22:38:38:misses                          4    65093566
      proc_arcstats.2016-04-04:22:39:15:misses                          4    65093566
      proc_arcstats.2016-04-04:22:39:51:misses                          4    65093566
      proc_arcstats.2016-04-04:22:40:28:misses                          4    65093566
      proc_arcstats.2016-04-04:22:41:05:misses                          4    65093566
      proc_arcstats.2016-04-04:22:41:41:misses                          4    65093566
      proc_arcstats.2016-04-04:22:42:18:misses                          4    65093566
      proc_arcstats.2016-04-04:22:42:54:misses                          4    65093566
      proc_arcstats.2016-04-04:22:43:31:misses                          4    65093566
      proc_arcstats.2016-04-04:22:44:07:misses                          4    65093566
      proc_arcstats.2016-04-04:22:44:44:misses                          4    65093593
      proc_arcstats.2016-04-04:22:45:24:misses                          4    65093635
      proc_arcstats.2016-04-04:22:46:05:misses                          4    65093688
      proc_arcstats.2016-04-04:22:46:47:misses                          4    65204992
      proc_arcstats.2016-04-04:22:47:27:misses                          4    65369668
      proc_arcstats.2016-04-04:22:48:08:misses                          4    65543276
      proc_arcstats.2016-04-04:22:48:49:misses                          4    65700706
      proc_arcstats.2016-04-04:22:49:29:misses                          4    65835289
      proc_arcstats.2016-04-04:22:50:10:misses                          4    65978426
      proc_arcstats.2016-04-04:22:50:51:misses                          4    66102294
      proc_arcstats.2016-04-04:22:51:32:misses                          4    66237487
      proc_arcstats.2016-04-04:22:52:12:misses                          4    66372029
      proc_arcstats.2016-04-04:22:52:53:misses                          4    66491789
      proc_arcstats.2016-04-04:22:53:33:misses                          4    66606048
      proc_arcstats.2016-04-04:22:54:14:misses                          4    66728350
      proc_arcstats.2016-04-04:22:54:54:misses                          4    66759150
      proc_arcstats.2016-04-04:22:55:35:misses                          4    66783980
      proc_arcstats.2016-04-04:22:56:16:misses                          4    66882343
      proc_arcstats.2016-04-04:22:56:56:misses                          4    66986649
      proc_arcstats.2016-04-04:22:57:37:misses                          4    67099214
      proc_arcstats.2016-04-04:22:58:18:misses                          4    67236074
      proc_arcstats.2016-04-04:22:58:58:misses                          4    67359854
      proc_arcstats.2016-04-04:22:59:39:misses                          4    67479208
      proc_arcstats.2016-04-04:23:00:19:misses                          4    67603703
      proc_arcstats.2016-04-04:23:01:00:misses                          4    67605334
      proc_arcstats.2016-04-04:23:01:42:misses                          4    67687472
      proc_arcstats.2016-04-04:23:02:22:misses                          4    67823336
      proc_arcstats.2016-04-04:23:03:03:misses                          4    67950500
      proc_arcstats.2016-04-04:23:03:44:misses                          4    68083660
      proc_arcstats.2016-04-04:23:04:24:misses                          4    68221409
      proc_arcstats.2016-04-04:23:05:04:misses                          4    68335712
      proc_arcstats.2016-04-04:23:05:45:misses                          4    68441748
      proc_arcstats.2016-04-04:23:06:24:misses                          4    68443307
      proc_arcstats.2016-04-04:23:07:05:misses                          4    68454637
      proc_arcstats.2016-04-04:23:07:46:misses                          4    68578571
      proc_arcstats.2016-04-04:23:08:27:misses                          4    68690534
      proc_arcstats.2016-04-04:23:09:07:misses                          4    68808692
      proc_arcstats.2016-04-04:23:09:48:misses                          4    68930912
      proc_arcstats.2016-04-04:23:10:28:misses                          4    69068534
      proc_arcstats.2016-04-04:23:11:09:misses                          4    69192434
      proc_arcstats.2016-04-04:23:11:50:misses                          4    69283782
      proc_arcstats.2016-04-04:23:12:30:misses                          4    69284522
      proc_arcstats.2016-04-04:23:13:12:misses                          4    69315159
      proc_arcstats.2016-04-04:23:13:52:misses                          4    69373955
      proc_arcstats.2016-04-04:23:14:33:misses                          4    69449303
      proc_arcstats.2016-04-04:23:15:13:misses                          4    69523499
      proc_arcstats.2016-04-04:23:15:53:misses                          4    69599070
      proc_arcstats.2016-04-04:23:16:34:misses                          4    69674993
      proc_arcstats.2016-04-04:23:17:14:misses                          4    69750733
      proc_arcstats.2016-04-04:23:17:55:misses                          4    69827307
      proc_arcstats.2016-04-04:23:18:35:misses                          4    69847427
      proc_arcstats.2016-04-04:23:19:16:misses                          4    69848231
      proc_arcstats.2016-04-04:23:19:57:misses                          4    69884019
      proc_arcstats.2016-04-04:23:20:37:misses                          4    69946465
      proc_arcstats.2016-04-04:23:21:18:misses                          4    70016284
      proc_arcstats.2016-04-04:23:21:58:misses                          4    70081329
      proc_arcstats.2016-04-04:23:22:39:misses                          4    70148143
      proc_arcstats.2016-04-04:23:23:19:misses                          4    70213937
      proc_arcstats.2016-04-04:23:23:59:misses                          4    70283057
      proc_arcstats.2016-04-04:23:24:40:misses                          4    70353187
      proc_arcstats.2016-04-04:23:25:21:misses                          4    70415880
      proc_arcstats.2016-04-04:23:26:00:misses                          4    70415996
      proc_arcstats.2016-04-04:23:26:37:misses                          4    70415996
      proc_arcstats.2016-04-04:23:27:13:misses                          4    70415996
      proc_arcstats.2016-04-04:23:27:50:misses                          4    70415996
      proc_arcstats.2016-04-04:23:28:26:misses                          4    70415996
      proc_arcstats.2016-04-04:23:29:03:misses                          4    70415996
      proc_arcstats.2016-04-04:23:29:40:misses                          4    70415996
      proc_arcstats.2016-04-04:23:30:16:misses                          4    70415996
      proc_arcstats.2016-04-04:23:30:53:misses                          4    70415996
      proc_arcstats.2016-04-04:23:31:29:misses                          4    70415996
      proc_arcstats.2016-04-04:23:32:06:misses                          4    70415996
      proc_arcstats.2016-04-04:23:32:42:misses                          4    70415996
      proc_arcstats.2016-04-04:23:33:19:misses                          4    70415996
      proc_arcstats.2016-04-04:23:33:55:misses                          4    70416034
      proc_arcstats.2016-04-04:23:34:36:misses                          4    70416109
      proc_arcstats.2016-04-04:23:35:18:misses                          4    70417565
      proc_arcstats.2016-04-04:23:35:58:misses                          4    70597975
      proc_arcstats.2016-04-04:23:36:39:misses                          4    70802556
      
      

      Attachments

        1. lustre-log.1459803990.12256.gz
          11.15 MB
        2. wolf-3-oss0-messages-20160403.gz
          256 kB
        3. wolf-4-oss1-messages.gz
          60 kB
        4. wolf-4-oss1-messages-20160403.gz
          141 kB
        5. wolf-4-ost1-stats.tgz
          3.91 MB
        6. wolf-5-mds-messages-20160404.gz
          92 kB
        7. wolf-6-client-messages.gz
          49 kB
        8. wolf-7-client-messages-20160403.gz
          157 kB

        Issue Links

          Activity

            [LU-7987] Lustre 2.8 OSS with zfs 0.6.5 backend hitting most schedule_timeout

            This is over a year old – I don't recall hitting this on 2.9.0 with ZFS 0.7.0. RC3

            jsalians_intel John Salinas (Inactive) added a comment - This is over a year old – I don't recall hitting this on 2.9.0 with ZFS 0.7.0. RC3

            Re-did the configuration so we now have 4 OSTs on 2 OSS like this:
            raidz2 11 data disks, 1 spare

            1. zpool list
              NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
              ost0 10T 123G 9.88T - 0% 1% 1.00x ONLINE -
              ost1 10T 123G 9.88T - 0% 1% 1.00x ONLINE -
            2. zpool status -v ost0
              pool: ost0
              state: ONLINE
              scan: none requested
              config:

            NAME STATE READ WRITE CKSUM
            ost0 ONLINE 0 0 0
            raidz2-0 ONLINE 0 0 0
            mpathaj ONLINE 0 0 0
            mpathai ONLINE 0 0 0
            mpathah ONLINE 0 0 0
            mpathag ONLINE 0 0 0
            mpathaq ONLINE 0 0 0
            mpathap ONLINE 0 0 0
            mpathak ONLINE 0 0 0
            mpathz ONLINE 0 0 0
            mpatham ONLINE 0 0 0
            mpathal ONLINE 0 0 0
            mpathao ONLINE 0 0 0
            spares
            mpathan AVAIL

            After several hours we hit:
            [679742.303316] LNet: Service thread pid 53598 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [679742.324593] Pid: 53598, comm: ll_ost02_003
            [679742.331254]
            [679742.331254] Call Trace:
            [679742.340814] [<ffffffff8163ae49>] schedule+0x29/0x70
            [679742.348390] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
            [679742.357295] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
            [679742.366769] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
            [679742.374128] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
            [679742.382209] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
            [679742.390536] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
            [679742.399626] [<ffffffff8163a518>] io_schedule+0x18/0x20
            [679742.407025] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
            [679742.415409] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
            [679742.424171] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
            [679742.432274] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
            [679742.440120] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
            [679742.448070] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
            [679742.456677] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
            [679742.465269] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
            [679742.473940] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
            [679742.483415] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
            [679742.492815] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
            [679742.501381] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
            [679742.509314] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
            [679742.517793] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
            [679742.526400] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
            [679742.535106] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
            [679742.543932] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
            [679742.552208] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [679742.561087] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [679742.570871] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [679742.579628] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [679742.588394] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
            [679742.596477] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
            [679742.604093] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [679742.612254] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
            [679742.620262] [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [679742.626677] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
            [679742.633165] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
            [679742.640131] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
            [679742.646598]
            [679742.649149] LustreError: dumping log to /tmp/lustre-log.1462399201.53598
            [680136.855471] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
            [680136.855471] req@ffff880fd8048c50 x1533409120101488/t4298772688(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 24 to 0 dl 1462399600 ref 2 fl Interpret:/0/0 rc 0/0
            [680143.700879] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
            [680143.713507] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)

            134217728 ns 4 819
            268435456 ns 4 331
            536870912 ns 4 1
            [root@wolf-3 proc]# cat spl/kstat/zfs/dmu_tx
            5 1 0x01 11 528 5351754919 689457420447816
            name type data
            dmu_tx_assigned 4 7598469
            dmu_tx_delay 4 0
            dmu_tx_error 4 0
            dmu_tx_suspended 4 0
            dmu_tx_group 4 2209
            dmu_tx_memory_reserve 4 0
            dmu_tx_memory_reclaim 4 0
            dmu_tx_dirty_throttle 4 0
            dmu_tx_dirty_delay 4 5878718
            dmu_tx_dirty_over_max 4 43
            dmu_tx_quota 4 0

            [335587.853762] LNet: Service thread pid 62899 was inactive for 200.63s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [335587.875541] Pid: 62899, comm: ll_ost02_003
            [335587.881750]
            [335587.881750] Call Trace:
            [335587.888632] [<ffffffff8163ae49>] schedule+0x29/0x70
            [335587.895317] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
            [335587.902969] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f
            [335587.910044] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
            [335587.918461] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
            [335587.927211] [<ffffffff8163a518>] io_schedule+0x18/0x20
            [335587.934223] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
            [335587.942193] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
            [335587.950855] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
            [335587.958626] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
            [335587.966154] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
            [335587.973730] [<ffffffff811c0505>] ? kfree+0x5/0x140
            [335587.980383] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
            [335587.989011] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
            [335587.997760] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
            [335588.006710] [<ffffffffa1267468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
            [335588.016378] [<ffffffffa12677d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
            [335588.026054] [<ffffffffa126a12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
            [335588.034667] [<ffffffffa13683bf>] ofd_trans_stop+0x1f/0x60 [ofd]
            [335588.042533] [<ffffffffa136bfd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
            [335588.051511] [<ffffffffa1364d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
            [335588.060174] [<ffffffffa10351b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
            [335588.068843] [<ffffffffa102f770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
            [335588.078141] [<ffffffffa1356fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
            [335588.086420] [<ffffffffa10c2e65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [335588.095262] [<ffffffffa106f54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [335588.104915] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f
            [335588.112281] [<ffffffffa106ddf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [335588.121002] [<ffffffffa10735f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [335588.129141] [<ffffffffa1072b60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
            [335588.137394] [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [335588.143860] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
            [335588.150534] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
            [335588.157549] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
            [335588.164131]
            [335588.166700] LustreError: dumping log to /tmp/lustre-log.1461715083.62899
            [335981.877892] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
            [335981.877892] req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 19 to 0 dl 1461715482 ref 2 fl Interpret:/0/0 rc 0/0
            [335981.913494] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
            [335988.734461] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
            [335988.746876] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
            [335988.759401] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:612 to 0x0:641
            [336371.937986] Lustre: 62899:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:385s); client may timeout. req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 19 to 0 dl 1461715482 ref 1 fl Complete:/0/0 rc 0/0
            [336371.973053] LNet: Service thread pid 62899 completed after 985.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

            1. cat spl/kstat/zfs/dmu_tx
              5 1 0x01 11 528 5351754919 739071997976145
              name type data
              dmu_tx_assigned 4 15535972
              dmu_tx_delay 4 0
              dmu_tx_error 4 0
              dmu_tx_suspended 4 0
              dmu_tx_group 4 2322
              dmu_tx_memory_reserve 4 0
              dmu_tx_memory_reclaim 4 0
              dmu_tx_dirty_throttle 4 0
              dmu_tx_dirty_delay 4 13023543
              dmu_tx_dirty_over_max 4 133
              dmu_tx_quota 4 0
              [root@wolf-3 proc]# [739014.340130] Lustre: 53889:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:431s); client may timeout. req@ffff880fd804d450 x1533409120922528/t4298730344(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0
              [739028.651612] LNet: Service thread pid 53598 was inactive for 1200.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
              [739028.675865] Pid: 53598, comm: ll_ost02_003
              [739028.682914]
              [739028.682914] Call Trace:
              [739028.692109] [<ffffffff8163ae49>] schedule+0x29/0x70
              [739028.699880] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
              [739028.709055] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
              [739028.718690] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
              [739028.726286] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
              [739028.734553] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
              [739028.743102] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
              [739028.752390] [<ffffffff8163a518>] io_schedule+0x18/0x20
              [739028.760058] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
              [739028.768673] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
              [739028.777539] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
              [739028.785539] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
              [739028.793281] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
              [739028.801142] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
              [739028.809709] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
              [739028.818152] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
              [739028.826887] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
              [739028.836333] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
              [739028.845744] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
              [739028.854301] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
              [739028.862236] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
              [739028.870718] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
              [739028.879272] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
              [739028.887981] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
              [739028.896822] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
              [739028.905006] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
              [739028.913852] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
              [739028.923532] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
              [739028.932235] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
              [739028.940874] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
              [739028.948909] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
              [739028.956470] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
              [739028.964477] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
              [739028.972438] [<ffffffff810a5aef>] kthread+0xcf/0xe0
              [739028.978828] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
              [739028.985305] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
              [739028.992242] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
              [739028.998681]
              [739029.001223] LustreError: dumping log to /tmp/lustre-log.1462458520.53598
              [739136.811037] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:554s); client may timeout. req@ffff880fd804d050 x1533409120922576/t4305999451(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0
              [739136.845944] LNet: Service thread pid 53598 completed after 1308.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
              [740277.248500] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (3/-152), not sending early reply
              [740277.248500] req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 0 to 0 dl 1462459772 ref 2 fl Interpret:/0/0 rc 0/0
              [740277.287603] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
              [740281.780696] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
              [740281.793398] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
              [740281.806603] Lustre: Skipped 1 previous similar message
              [740281.814757] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:609
              [740727.562148] LNet: Service thread pid 53598 was inactive for 1202.40s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
              [740727.585245] Pid: 53598, comm: ll_ost02_003
              [740727.591721]
              [740727.591721] Call Trace:
              [740727.599850] [<ffffffff8163ae49>] schedule+0x29/0x70
              [740727.607143] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
              [740727.615515] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
              [740727.624759] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
              [740727.632053] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
              [740727.639920] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
              [740727.648166] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
              [740727.657014] [<ffffffff8163a518>] io_schedule+0x18/0x20
              [740727.664332] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
              [740727.672529] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
              [740727.681141] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
              [740727.689033] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
              [740727.696713] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
              [740727.704421] [<ffffffffa00dcad0>] dmu_buf_will_dirty+0x70/0x120 [zfs]
              [740727.712909] [<ffffffffa00fc5d6>] dnode_dirty_l1+0x36/0x50 [zfs]
              [740727.720855] [<ffffffffa010197a>] dnode_free_range+0x4fa/0xbb0 [zfs]
              [740727.729183] [<ffffffffa0153095>] ? txg_rele_to_quiesce+0x45/0x80 [zfs]
              [740727.737748] [<ffffffffa00f6fae>] ? dmu_tx_assign+0x50e/0x760 [zfs]
              [740727.745922] [<ffffffffa00e3a45>] dmu_free_long_range+0x215/0x370 [zfs]
              [740727.754422] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
              [740727.763708] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
              [740727.772977] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
              [740727.781391] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
              [740727.789208] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
              [740727.797612] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
              [740727.806069] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
              [740727.814704] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
              [740727.823508] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
              [740727.831667] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
              [740727.840467] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
              [740727.850109] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
              [740727.858775] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
              [740727.867381] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
              [740727.875391] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
              [740727.882922] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
              [740727.890901] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
              [740727.898821] [<ffffffff810a5aef>] kthread+0xcf/0xe0
              [740727.905168] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
              [740727.911593] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
              [740727.918478] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
              [740727.924859]
              [740727.927331] LustreError: dumping log to /tmp/lustre-log.1462460220.53598
              [741038.149730] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
              [741038.162186] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
              [741038.175159] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:641
              [741793.608101] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
              [741793.620544] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
              [741793.633589] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:673
              [742172.088263] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:1893s); client may timeout. req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462459772 ref 1 fl Complete:/0/0 rc 0/0
              [742172.123285] LNet: Service thread pid 53598 completed after 2647.75s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

            It appears things from the client perspective things have deadlocked and are not making progress anymore. Going to reboot OSS0/wolf-3.

            jsalians_intel John Salinas (Inactive) added a comment - Re-did the configuration so we now have 4 OSTs on 2 OSS like this: raidz2 11 data disks, 1 spare zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ost0 10T 123G 9.88T - 0% 1% 1.00x ONLINE - ost1 10T 123G 9.88T - 0% 1% 1.00x ONLINE - zpool status -v ost0 pool: ost0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ost0 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 mpathaj ONLINE 0 0 0 mpathai ONLINE 0 0 0 mpathah ONLINE 0 0 0 mpathag ONLINE 0 0 0 mpathaq ONLINE 0 0 0 mpathap ONLINE 0 0 0 mpathak ONLINE 0 0 0 mpathz ONLINE 0 0 0 mpatham ONLINE 0 0 0 mpathal ONLINE 0 0 0 mpathao ONLINE 0 0 0 spares mpathan AVAIL After several hours we hit: [679742.303316] LNet: Service thread pid 53598 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [679742.324593] Pid: 53598, comm: ll_ost02_003 [679742.331254] [679742.331254] Call Trace: [679742.340814] [<ffffffff8163ae49>] schedule+0x29/0x70 [679742.348390] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0 [679742.357295] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs] [679742.366769] [<ffffffff8101c829>] ? read_tsc+0x9/0x10 [679742.374128] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0 [679742.382209] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130 [679742.390536] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90 [679742.399626] [<ffffffff8163a518>] io_schedule+0x18/0x20 [679742.407025] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl] [679742.415409] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 [679742.424171] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl] [679742.432274] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs] [679742.440120] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs] [679742.448070] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs] [679742.456677] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs] [679742.465269] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs] [679742.473940] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [679742.483415] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [679742.492815] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [679742.501381] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd] [679742.509314] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd] [679742.517793] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd] [679742.526400] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [679742.535106] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [679742.543932] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd] [679742.552208] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc] [679742.561087] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [679742.570871] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [679742.579628] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [679742.588394] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20 [679742.596477] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90 [679742.604093] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [679742.612254] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] [679742.620262] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [679742.626677] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [679742.633165] [<ffffffff81645e18>] ret_from_fork+0x58/0x90 [679742.640131] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [679742.646598] [679742.649149] LustreError: dumping log to /tmp/lustre-log.1462399201.53598 [680136.855471] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply [680136.855471] req@ffff880fd8048c50 x1533409120101488/t4298772688(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 24 to 0 dl 1462399600 ref 2 fl Interpret:/0/0 rc 0/0 [680143.700879] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting [680143.713507] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) 134217728 ns 4 819 268435456 ns 4 331 536870912 ns 4 1 [root@wolf-3 proc] # cat spl/kstat/zfs/dmu_tx 5 1 0x01 11 528 5351754919 689457420447816 name type data dmu_tx_assigned 4 7598469 dmu_tx_delay 4 0 dmu_tx_error 4 0 dmu_tx_suspended 4 0 dmu_tx_group 4 2209 dmu_tx_memory_reserve 4 0 dmu_tx_memory_reclaim 4 0 dmu_tx_dirty_throttle 4 0 dmu_tx_dirty_delay 4 5878718 dmu_tx_dirty_over_max 4 43 dmu_tx_quota 4 0 [335587.853762] LNet: Service thread pid 62899 was inactive for 200.63s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [335587.875541] Pid: 62899, comm: ll_ost02_003 [335587.881750] [335587.881750] Call Trace: [335587.888632] [<ffffffff8163ae49>] schedule+0x29/0x70 [335587.895317] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0 [335587.902969] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f [335587.910044] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130 [335587.918461] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90 [335587.927211] [<ffffffff8163a518>] io_schedule+0x18/0x20 [335587.934223] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl] [335587.942193] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 [335587.950855] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl] [335587.958626] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs] [335587.966154] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs] [335587.973730] [<ffffffff811c0505>] ? kfree+0x5/0x140 [335587.980383] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs] [335587.989011] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs] [335587.997760] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs] [335588.006710] [<ffffffffa1267468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [335588.016378] [<ffffffffa12677d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [335588.026054] [<ffffffffa126a12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [335588.034667] [<ffffffffa13683bf>] ofd_trans_stop+0x1f/0x60 [ofd] [335588.042533] [<ffffffffa136bfd2>] ofd_object_destroy+0x2b2/0x890 [ofd] [335588.051511] [<ffffffffa1364d17>] ofd_destroy_by_fid+0x307/0x510 [ofd] [335588.060174] [<ffffffffa10351b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [335588.068843] [<ffffffffa102f770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [335588.078141] [<ffffffffa1356fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd] [335588.086420] [<ffffffffa10c2e65>] tgt_request_handle+0x915/0x1320 [ptlrpc] [335588.095262] [<ffffffffa106f54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [335588.104915] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f [335588.112281] [<ffffffffa106ddf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [335588.121002] [<ffffffffa10735f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [335588.129141] [<ffffffffa1072b60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] [335588.137394] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [335588.143860] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [335588.150534] [<ffffffff81645e18>] ret_from_fork+0x58/0x90 [335588.157549] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [335588.164131] [335588.166700] LustreError: dumping log to /tmp/lustre-log.1461715083.62899 [335981.877892] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply [335981.877892] req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 19 to 0 dl 1461715482 ref 2 fl Interpret:/0/0 rc 0/0 [335981.913494] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message [335988.734461] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting [335988.746876] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [335988.759401] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:612 to 0x0:641 [336371.937986] Lustre: 62899:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:385s); client may timeout. req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 19 to 0 dl 1461715482 ref 1 fl Complete:/0/0 rc 0/0 [336371.973053] LNet: Service thread pid 62899 completed after 985.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). cat spl/kstat/zfs/dmu_tx 5 1 0x01 11 528 5351754919 739071997976145 name type data dmu_tx_assigned 4 15535972 dmu_tx_delay 4 0 dmu_tx_error 4 0 dmu_tx_suspended 4 0 dmu_tx_group 4 2322 dmu_tx_memory_reserve 4 0 dmu_tx_memory_reclaim 4 0 dmu_tx_dirty_throttle 4 0 dmu_tx_dirty_delay 4 13023543 dmu_tx_dirty_over_max 4 133 dmu_tx_quota 4 0 [root@wolf-3 proc] # [739014.340130] Lustre: 53889:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:431s); client may timeout. req@ffff880fd804d450 x1533409120922528/t4298730344(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0 [739028.651612] LNet: Service thread pid 53598 was inactive for 1200.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [739028.675865] Pid: 53598, comm: ll_ost02_003 [739028.682914] [739028.682914] Call Trace: [739028.692109] [<ffffffff8163ae49>] schedule+0x29/0x70 [739028.699880] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0 [739028.709055] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs] [739028.718690] [<ffffffff8101c829>] ? read_tsc+0x9/0x10 [739028.726286] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0 [739028.734553] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130 [739028.743102] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90 [739028.752390] [<ffffffff8163a518>] io_schedule+0x18/0x20 [739028.760058] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl] [739028.768673] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 [739028.777539] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl] [739028.785539] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs] [739028.793281] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs] [739028.801142] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs] [739028.809709] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs] [739028.818152] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs] [739028.826887] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [739028.836333] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [739028.845744] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [739028.854301] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd] [739028.862236] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd] [739028.870718] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd] [739028.879272] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [739028.887981] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [739028.896822] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd] [739028.905006] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc] [739028.913852] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [739028.923532] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [739028.932235] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [739028.940874] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20 [739028.948909] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90 [739028.956470] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [739028.964477] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] [739028.972438] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [739028.978828] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [739028.985305] [<ffffffff81645e18>] ret_from_fork+0x58/0x90 [739028.992242] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [739028.998681] [739029.001223] LustreError: dumping log to /tmp/lustre-log.1462458520.53598 [739136.811037] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:554s); client may timeout. req@ffff880fd804d050 x1533409120922576/t4305999451(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0 [739136.845944] LNet: Service thread pid 53598 completed after 1308.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [740277.248500] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (3/-152), not sending early reply [740277.248500] req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 0 to 0 dl 1462459772 ref 2 fl Interpret:/0/0 rc 0/0 [740277.287603] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message [740281.780696] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting [740281.793398] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [740281.806603] Lustre: Skipped 1 previous similar message [740281.814757] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:609 [740727.562148] LNet: Service thread pid 53598 was inactive for 1202.40s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [740727.585245] Pid: 53598, comm: ll_ost02_003 [740727.591721] [740727.591721] Call Trace: [740727.599850] [<ffffffff8163ae49>] schedule+0x29/0x70 [740727.607143] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0 [740727.615515] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs] [740727.624759] [<ffffffff8101c829>] ? read_tsc+0x9/0x10 [740727.632053] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0 [740727.639920] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130 [740727.648166] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90 [740727.657014] [<ffffffff8163a518>] io_schedule+0x18/0x20 [740727.664332] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl] [740727.672529] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 [740727.681141] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl] [740727.689033] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs] [740727.696713] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs] [740727.704421] [<ffffffffa00dcad0>] dmu_buf_will_dirty+0x70/0x120 [zfs] [740727.712909] [<ffffffffa00fc5d6>] dnode_dirty_l1+0x36/0x50 [zfs] [740727.720855] [<ffffffffa010197a>] dnode_free_range+0x4fa/0xbb0 [zfs] [740727.729183] [<ffffffffa0153095>] ? txg_rele_to_quiesce+0x45/0x80 [zfs] [740727.737748] [<ffffffffa00f6fae>] ? dmu_tx_assign+0x50e/0x760 [zfs] [740727.745922] [<ffffffffa00e3a45>] dmu_free_long_range+0x215/0x370 [zfs] [740727.754422] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] [740727.763708] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] [740727.772977] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] [740727.781391] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd] [740727.789208] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd] [740727.797612] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd] [740727.806069] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [740727.814704] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] [740727.823508] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd] [740727.831667] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc] [740727.840467] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [740727.850109] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [740727.858775] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [740727.867381] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20 [740727.875391] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90 [740727.882922] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [740727.890901] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] [740727.898821] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [740727.905168] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [740727.911593] [<ffffffff81645e18>] ret_from_fork+0x58/0x90 [740727.918478] [<ffffffff810a5a20>] ? kthread+0x0/0xe0 [740727.924859] [740727.927331] LustreError: dumping log to /tmp/lustre-log.1462460220.53598 [741038.149730] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting [741038.162186] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [741038.175159] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:641 [741793.608101] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting [741793.620544] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) [741793.633589] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:673 [742172.088263] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:1893s); client may timeout. req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462459772 ref 1 fl Complete:/0/0 rc 0/0 [742172.123285] LNet: Service thread pid 53598 completed after 2647.75s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). It appears things from the client perspective things have deadlocked and are not making progress anymore. Going to reboot OSS0/wolf-3.

            The message "lsdraid-OST0001: can't assign tx: rc = -5" seems serious. One thing I would to do is to check the status of zpool. I will log into the node tomorrow and take a look.

            jay Jinshan Xiong (Inactive) added a comment - The message "lsdraid-OST0001: can't assign tx: rc = -5" seems serious. One thing I would to do is to check the status of zpool. I will log into the node tomorrow and take a look.

            wolf has been running today and the only error I have seen is this:

             
            [26193.993322] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [26194.007089] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) Skipped 17 previous similar messages
            [26195.013534] LustreError: 7812:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [26197.058425] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [26197.072181] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) Skipped 31 previous similar messages
            [26202.409535] LustreError: 7824:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for usr 0: rc = -52
            [26202.473611] LustreError: 7824:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for grp 0: rc = -52
            [26213.811138] LustreError: 7823:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for usr 0: rc = -52
            [26213.834184] LustreError: 7823:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for grp 0: rc = -52
            
            [root@wolf-4 stat]# [26390.320802] INFO: task ll_ost_io04_001:7405 blocked for more than 120 seconds.
            [26390.331510] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26390.342516] ll_ost_io04_001 D 0000000000000000     0  7405      2 0x00000080
            [26390.352479]  ffff881fe19f3820 0000000000000046 ffff881fe19d0b80 ffff881fe19f3fd8
            [26390.362848]  ffff881fe19f3fd8 ffff881fe19f3fd8 ffff881fe19d0b80 ffff882023826188
            [26390.373083]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26390.383196] Call Trace:
            [26390.387579]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26390.394805]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26390.403226]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26390.411237]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26390.418873]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26390.426784]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26390.434776]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26390.443327]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26390.451296]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26390.459814]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26390.467655]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26390.475793]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26390.483986]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26390.491214]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26390.500665]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26390.507766]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26390.515856]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26390.524919]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26390.533603]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26390.543120]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26390.551721]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26390.560263]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26390.568210]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26390.575685]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26390.583661]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26390.592883]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26390.599299]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26390.607549]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26390.614505]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26390.622721] INFO: task ll_ost_io06_001:7411 blocked for more than 120 seconds.
            [26390.631720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26390.641427] ll_ost_io06_001 D 0000000000000000     0  7411      2 0x00000080
            [26390.650360]  ffff881fcfe83820 0000000000000046 ffff881fe19d5080 ffff881fcfe83fd8
            [26390.659664]  ffff881fcfe83fd8 ffff881fcfe83fd8 ffff881fe19d5080 ffff882023826188
            [26390.668954]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26390.678219] Call Trace:
            [26390.681832]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26390.688244]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26390.696005]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26390.703365]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26390.710421]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26390.717767]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26390.725265]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26390.733340]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26390.740903]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26390.749078]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26390.756745]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26390.764682]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26390.772773]  [<ffffffff810bb685>] ? sched_clock_cpu+0x85/0xc0
            [26390.780139]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26390.789160]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26390.797788]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26390.807286]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26390.815831]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26390.824360]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26390.832327]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26390.841543]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26390.847934]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26390.856187]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26390.863168]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26390.871421] INFO: task ll_ost_io06_002:7412 blocked for more than 120 seconds.
            [26390.880442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26390.890160] ll_ost_io06_002 D 0000000000000000     0  7412      2 0x00000080
            [26390.899063]  ffff881fcfea3820 0000000000000046 ffff881fe19d5c00 ffff881fcfea3fd8
            [26390.908403]  ffff881fcfea3fd8 ffff881fcfea3fd8 ffff881fe19d5c00 ffff882023826188
            [26390.917719]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26390.927047] Call Trace:
            [26390.930756]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26390.937302]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26390.945182]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26390.952678]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26390.959892]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26390.967411]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26390.975079]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26390.983333]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26390.991091]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26390.999426]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26391.007107]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26391.015062]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26391.023154]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26391.030285]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26391.039640]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26391.046664]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26391.054686]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26391.063693]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26391.072310]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26391.081784]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26391.090329]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26391.098827]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26391.106759]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26391.114233]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26391.122197]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26391.131389]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26391.137766]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.145954]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26391.152640]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.160581] INFO: task ll_ost_io06_003:7798 blocked for more than 120 seconds.
            [26391.169290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26391.178691] ll_ost_io06_003 D 0000000000000000     0  7798      2 0x00000080
            [26391.187282]  ffff881fcf707820 0000000000000046 ffff880fe632b980 ffff881fcf707fd8
            [26391.196297]  ffff881fcf707fd8 ffff881fcf707fd8 ffff880fe632b980 ffff882023826188
            [26391.205367]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26391.214356] Call Trace:
            [26391.217695]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26391.223932]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26391.231443]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26391.238673]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26391.245587]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26391.252750]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26391.260036]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26391.267960]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26391.275409]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26391.283418]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26391.290698]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26391.298359]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26391.306122]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26391.312957]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26391.321966]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26391.328705]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26391.336477]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26391.345249]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26391.353534]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26391.362639]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26391.370865]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26391.379030]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26391.386624]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26391.393808]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26391.401443]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26391.410245]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26391.416274]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.424248]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26391.430967]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.438826] INFO: task ll_ost_io04_003:7800 blocked for more than 120 seconds.
            [26391.447481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26391.456819] ll_ost_io04_003 D 0000000000000000     0  7800      2 0x00000080
            [26391.465395]  ffff881fceb03820 0000000000000046 ffff880fe632d080 ffff881fceb03fd8
            [26391.474374]  ffff881fceb03fd8 ffff881fceb03fd8 ffff880fe632d080 ffff882023826188
            [26391.483386]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26391.492473] Call Trace:
            [26391.495789]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26391.501972]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26391.509493]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26391.516677]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26391.523585]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26391.530798]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26391.538139]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26391.546029]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26391.553411]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26391.561413]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26391.568753]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26391.576388]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26391.584183]  [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc]
            [26391.592094]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26391.601111]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26391.607880]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26391.615618]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26391.624310]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26391.632656]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26391.641806]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26391.650048]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26391.658298]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26391.665909]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26391.673105]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26391.680724]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26391.689587]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26391.695667]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.703689]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26391.710360]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.718298] INFO: task ll_ost_io04_005:7805 blocked for more than 120 seconds.
            [26391.727050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26391.736388] ll_ost_io04_005 D 0000000000000000     0  7805      2 0x00000080
            [26391.744976]  ffff881fcd54f820 0000000000000046 ffff880fe6328b80 ffff881fcd54ffd8
            [26391.753892]  ffff881fcd54ffd8 ffff881fcd54ffd8 ffff880fe6328b80 ffff882023826188
            [26391.762929]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26391.771955] Call Trace:
            [26391.775349]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26391.781550]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26391.789027]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26391.796180]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26391.803088]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26391.810220]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26391.817518]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26391.825398]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26391.832827]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26391.840834]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26391.848185]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26391.855785]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26391.863569]  [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc]
            [26391.871459]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26391.880474]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26391.887226]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26391.894979]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26391.903711]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26391.912027]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26391.921212]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26391.929425]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26391.937700]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26391.945284]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26391.952467]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26391.960077]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26391.968982]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26391.975100]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.983005]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26391.989681]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26391.997598] INFO: task ll_ost_io06_005:7810 blocked for more than 120 seconds.
            [26392.006309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26392.015707] ll_ost_io06_005 D 0000000000000000     0  7810      2 0x00000080
            [26392.024252]  ffff881fc94fb820 0000000000000046 ffff882007258b80 ffff881fc94fbfd8
            [26392.033325]  ffff881fc94fbfd8 ffff881fc94fbfd8 ffff882007258b80 ffff882023826188
            [26392.042335]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26392.051388] Call Trace:
            [26392.054740]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26392.060967]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26392.068507]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26392.075715]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26392.082626]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26392.089842]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26392.097207]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26392.105081]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26392.112560]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26392.120610]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26392.127927]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26392.135570]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26392.143385]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26392.150241]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26392.159307]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26392.166078]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26392.173820]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26392.182548]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26392.190893]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26392.200052]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26392.208379]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26392.216617]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26392.224260]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26392.231463]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26392.239134]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26392.248015]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26392.254058]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.262001]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26392.268656]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.276534] INFO: task ll_ost_io06_006:7811 blocked for more than 120 seconds.
            [26392.285240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26392.294631] ll_ost_io06_006 D 0000000000000000     0  7811      2 0x00000080
            [26392.303192]  ffff881fc4f4f820 0000000000000046 ffff88200725e780 ffff881fc4f4ffd8
            [26392.312202]  ffff881fc4f4ffd8 ffff881fc4f4ffd8 ffff88200725e780 ffff882023826188
            [26392.321131]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26392.330084] Call Trace:
            [26392.333495]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26392.339767]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26392.347297]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26392.354490]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26392.361328]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26392.368491]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26392.375831]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26392.383699]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26392.391100]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26392.399078]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26392.406398]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26392.414017]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26392.421798]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26392.428654]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26392.437717]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26392.444446]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26392.452174]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26392.460902]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26392.469160]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26392.478343]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26392.486615]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26392.494801]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26392.502410]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26392.509565]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26392.517186]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26392.526081]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26392.532152]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.540030]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26392.546706]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.554619] INFO: task ll_ost_io04_006:7812 blocked for more than 120 seconds.
            [26392.563275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26392.572639] ll_ost_io04_006 D 0000000000000000     0  7812      2 0x00000080
            [26392.581244]  ffff881fc2dbb820 0000000000000046 ffff88200725f300 ffff881fc2dbbfd8
            [26392.590196]  ffff881fc2dbbfd8 ffff881fc2dbbfd8 ffff88200725f300 ffff882023826188
            [26392.599200]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26392.608179] Call Trace:
            [26392.611461]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26392.617615]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26392.625146]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26392.632336]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26392.639157]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26392.646282]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26392.653554]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26392.661465]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26392.668859]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26392.676864]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26392.684204]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26392.691773]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26392.699584]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26392.706366]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26392.715432]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26392.722155]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26392.729943]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26392.738687]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26392.746981]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26392.756134]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26392.764374]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26392.772528]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26392.780169]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26392.787304]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26392.794941]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26392.803821]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26392.809884]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.817799]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26392.824441]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26392.832351] INFO: task ll_ost_io06_007:7814 blocked for more than 120 seconds.
            [26392.841032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [26392.850402] ll_ost_io06_007 D 0000000000000000     0  7814      2 0x00000080
            [26392.858975]  ffff882003b8f820 0000000000000046 ffff881ffb070b80 ffff882003b8ffd8
            [26392.868112]  ffff882003b8ffd8 ffff882003b8ffd8 ffff881ffb070b80 ffff882023826188
            [26392.877083]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
            [26392.886070] Call Trace:
            [26392.889475]  [<ffffffff8163aab9>] schedule+0x29/0x70
            [26392.895733]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
            [26392.903261]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [26392.910430]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
            [26392.917280]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
            [26392.924478]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
            [26392.931789]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            [26392.939703]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
            [26392.947081]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
            [26392.955051]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
            [26392.962394]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
            [26392.970013]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
            [26392.977795]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
            [26392.984610]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
            [26392.993681]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
            [26393.000376]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
            [26393.008111]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [26393.016799]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
            [26393.025125]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
            [26393.034281]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
            [26393.042522]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [26393.050759]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
            [26393.058388]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
            [26393.065579]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
            [26393.073196]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
            [26393.082080]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [26393.088140]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [26393.096053]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
            [26393.102718]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [27451.932819] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [27451.945718] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 2 previous similar messages
            [27453.007869] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [27453.020784] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message
            [27455.047314] LustreError: 7817:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [27462.623135] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
            [27462.624172] LustreError: 7823:0:(ofd_dev.c:1884:ofd_destroy_hdl()) lsdraid-OST0001: error destroying object [0x100010000:0x144:0x0]: -5
            [27462.652652] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message
            [28445.940895] perf interrupt took too long (5003 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
            
            [root@wolf-4 stat]# [32239.225932] LustreError: 7821:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for usr 0: rc = -52
            [32239.248290] LustreError: 7821:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for grp 0: rc = -52
            
            
            jsalians_intel John Salinas (Inactive) added a comment - wolf has been running today and the only error I have seen is this: [26193.993322] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26194.007089] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) Skipped 17 previous similar messages [26195.013534] LustreError: 7812:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26197.058425] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26197.072181] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) Skipped 31 previous similar messages [26202.409535] LustreError: 7824:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for usr 0: rc = -52 [26202.473611] LustreError: 7824:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for grp 0: rc = -52 [26213.811138] LustreError: 7823:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for usr 0: rc = -52 [26213.834184] LustreError: 7823:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for grp 0: rc = -52 [root@wolf-4 stat]# [26390.320802] INFO: task ll_ost_io04_001:7405 blocked for more than 120 seconds. [26390.331510] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.342516] ll_ost_io04_001 D 0000000000000000 0 7405 2 0x00000080 [26390.352479] ffff881fe19f3820 0000000000000046 ffff881fe19d0b80 ffff881fe19f3fd8 [26390.362848] ffff881fe19f3fd8 ffff881fe19f3fd8 ffff881fe19d0b80 ffff882023826188 [26390.373083] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.383196] Call Trace: [26390.387579] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.394805] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.403226] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.411237] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.418873] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.426784] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.434776] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.443327] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.451296] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.459814] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26390.467655] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26390.475793] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26390.483986] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26390.491214] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26390.500665] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26390.507766] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26390.515856] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26390.524919] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26390.533603] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26390.543120] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26390.551721] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26390.560263] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26390.568210] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26390.575685] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26390.583661] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26390.592883] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26390.599299] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.607549] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26390.614505] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.622721] INFO: task ll_ost_io06_001:7411 blocked for more than 120 seconds. [26390.631720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.641427] ll_ost_io06_001 D 0000000000000000 0 7411 2 0x00000080 [26390.650360] ffff881fcfe83820 0000000000000046 ffff881fe19d5080 ffff881fcfe83fd8 [26390.659664] ffff881fcfe83fd8 ffff881fcfe83fd8 ffff881fe19d5080 ffff882023826188 [26390.668954] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.678219] Call Trace: [26390.681832] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.688244] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.696005] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.703365] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.710421] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.717767] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.725265] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.733340] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.740903] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.749078] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26390.756745] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26390.764682] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26390.772773] [<ffffffff810bb685>] ? sched_clock_cpu+0x85/0xc0 [26390.780139] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26390.789160] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26390.797788] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26390.807286] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26390.815831] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26390.824360] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26390.832327] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26390.841543] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26390.847934] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.856187] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26390.863168] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.871421] INFO: task ll_ost_io06_002:7412 blocked for more than 120 seconds. [26390.880442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.890160] ll_ost_io06_002 D 0000000000000000 0 7412 2 0x00000080 [26390.899063] ffff881fcfea3820 0000000000000046 ffff881fe19d5c00 ffff881fcfea3fd8 [26390.908403] ffff881fcfea3fd8 ffff881fcfea3fd8 ffff881fe19d5c00 ffff882023826188 [26390.917719] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.927047] Call Trace: [26390.930756] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.937302] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.945182] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.952678] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.959892] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.967411] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.975079] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.983333] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.991091] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.999426] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.007107] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.015062] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.023154] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26391.030285] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.039640] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.046664] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.054686] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.063693] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.072310] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.081784] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.090329] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.098827] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.106759] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.114233] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.122197] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.131389] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.137766] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.145954] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.152640] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.160581] INFO: task ll_ost_io06_003:7798 blocked for more than 120 seconds. [26391.169290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.178691] ll_ost_io06_003 D 0000000000000000 0 7798 2 0x00000080 [26391.187282] ffff881fcf707820 0000000000000046 ffff880fe632b980 ffff881fcf707fd8 [26391.196297] ffff881fcf707fd8 ffff881fcf707fd8 ffff880fe632b980 ffff882023826188 [26391.205367] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.214356] Call Trace: [26391.217695] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.223932] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.231443] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.238673] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.245587] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.252750] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.260036] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.267960] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.275409] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.283418] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.290698] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.298359] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.306122] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26391.312957] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.321966] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.328705] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.336477] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.345249] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.353534] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.362639] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.370865] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.379030] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.386624] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.393808] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.401443] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.410245] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.416274] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.424248] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.430967] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.438826] INFO: task ll_ost_io04_003:7800 blocked for more than 120 seconds. [26391.447481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.456819] ll_ost_io04_003 D 0000000000000000 0 7800 2 0x00000080 [26391.465395] ffff881fceb03820 0000000000000046 ffff880fe632d080 ffff881fceb03fd8 [26391.474374] ffff881fceb03fd8 ffff881fceb03fd8 ffff880fe632d080 ffff882023826188 [26391.483386] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.492473] Call Trace: [26391.495789] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.501972] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.509493] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.516677] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.523585] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.530798] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.538139] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.546029] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.553411] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.561413] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.568753] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.576388] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.584183] [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc] [26391.592094] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.601111] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.607880] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.615618] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.624310] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.632656] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.641806] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.650048] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.658298] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.665909] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.673105] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.680724] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.689587] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.695667] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.703689] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.710360] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.718298] INFO: task ll_ost_io04_005:7805 blocked for more than 120 seconds. [26391.727050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.736388] ll_ost_io04_005 D 0000000000000000 0 7805 2 0x00000080 [26391.744976] ffff881fcd54f820 0000000000000046 ffff880fe6328b80 ffff881fcd54ffd8 [26391.753892] ffff881fcd54ffd8 ffff881fcd54ffd8 ffff880fe6328b80 ffff882023826188 [26391.762929] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.771955] Call Trace: [26391.775349] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.781550] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.789027] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.796180] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.803088] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.810220] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.817518] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.825398] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.832827] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.840834] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.848185] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.855785] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.863569] [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc] [26391.871459] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.880474] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.887226] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.894979] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.903711] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.912027] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.921212] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.929425] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.937700] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.945284] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.952467] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.960077] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.968982] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.975100] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.983005] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.989681] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.997598] INFO: task ll_ost_io06_005:7810 blocked for more than 120 seconds. [26392.006309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.015707] ll_ost_io06_005 D 0000000000000000 0 7810 2 0x00000080 [26392.024252] ffff881fc94fb820 0000000000000046 ffff882007258b80 ffff881fc94fbfd8 [26392.033325] ffff881fc94fbfd8 ffff881fc94fbfd8 ffff882007258b80 ffff882023826188 [26392.042335] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.051388] Call Trace: [26392.054740] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.060967] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.068507] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.075715] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.082626] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.089842] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.097207] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.105081] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.112560] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.120610] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.127927] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.135570] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.143385] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.150241] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.159307] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.166078] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.173820] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.182548] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.190893] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.200052] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.208379] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.216617] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.224260] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.231463] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.239134] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.248015] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.254058] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.262001] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.268656] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.276534] INFO: task ll_ost_io06_006:7811 blocked for more than 120 seconds. [26392.285240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.294631] ll_ost_io06_006 D 0000000000000000 0 7811 2 0x00000080 [26392.303192] ffff881fc4f4f820 0000000000000046 ffff88200725e780 ffff881fc4f4ffd8 [26392.312202] ffff881fc4f4ffd8 ffff881fc4f4ffd8 ffff88200725e780 ffff882023826188 [26392.321131] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.330084] Call Trace: [26392.333495] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.339767] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.347297] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.354490] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.361328] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.368491] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.375831] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.383699] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.391100] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.399078] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.406398] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.414017] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.421798] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.428654] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.437717] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.444446] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.452174] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.460902] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.469160] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.478343] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.486615] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.494801] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.502410] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.509565] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.517186] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.526081] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.532152] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.540030] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.546706] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.554619] INFO: task ll_ost_io04_006:7812 blocked for more than 120 seconds. [26392.563275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.572639] ll_ost_io04_006 D 0000000000000000 0 7812 2 0x00000080 [26392.581244] ffff881fc2dbb820 0000000000000046 ffff88200725f300 ffff881fc2dbbfd8 [26392.590196] ffff881fc2dbbfd8 ffff881fc2dbbfd8 ffff88200725f300 ffff882023826188 [26392.599200] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.608179] Call Trace: [26392.611461] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.617615] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.625146] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.632336] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.639157] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.646282] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.653554] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.661465] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.668859] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.676864] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.684204] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.691773] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.699584] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.706366] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.715432] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.722155] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.729943] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.738687] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.746981] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.756134] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.764374] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.772528] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.780169] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.787304] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.794941] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.803821] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.809884] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.817799] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.824441] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.832351] INFO: task ll_ost_io06_007:7814 blocked for more than 120 seconds. [26392.841032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.850402] ll_ost_io06_007 D 0000000000000000 0 7814 2 0x00000080 [26392.858975] ffff882003b8f820 0000000000000046 ffff881ffb070b80 ffff882003b8ffd8 [26392.868112] ffff882003b8ffd8 ffff882003b8ffd8 ffff881ffb070b80 ffff882023826188 [26392.877083] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.886070] Call Trace: [26392.889475] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.895733] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.903261] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.910430] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.917280] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.924478] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.931789] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.939703] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.947081] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.955051] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.962394] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.970013] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.977795] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.984610] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.993681] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26393.000376] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26393.008111] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26393.016799] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26393.025125] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26393.034281] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26393.042522] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26393.050759] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26393.058388] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26393.065579] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26393.073196] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26393.082080] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26393.088140] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26393.096053] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26393.102718] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [27451.932819] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27451.945718] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 2 previous similar messages [27453.007869] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27453.020784] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message [27455.047314] LustreError: 7817:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27462.623135] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27462.624172] LustreError: 7823:0:(ofd_dev.c:1884:ofd_destroy_hdl()) lsdraid-OST0001: error destroying object [0x100010000:0x144:0x0]: -5 [27462.652652] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message [28445.940895] perf interrupt took too long (5003 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [root@wolf-4 stat]# [32239.225932] LustreError: 7821:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for usr 0: rc = -52 [32239.248290] LustreError: 7821:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for grp 0: rc = -52

            Yes, in the second comment I mention I set mem=2G in order to get a dump. However, I believe the dump is due to a separate issue. I could be wrong on this. Please see the one full dump I have on the /scratch nfs dir on wolf.

            jsalians_intel John Salinas (Inactive) added a comment - Yes, in the second comment I mention I set mem=2G in order to get a dump. However, I believe the dump is due to a separate issue. I could be wrong on this. Please see the one full dump I have on the /scratch nfs dir on wolf.

            From the arcstats, p is very close to c, and c equals to c_max. This means there is no memory pressure at all, which can be verified by direct or indirect count below. The mru ghost hit is only 917736 I don't know what made p that large.

            John: If crash dump is problematic due to size, we can reduce the size of memory by appending 'mem=?G' to boot line.

            jay Jinshan Xiong (Inactive) added a comment - From the arcstats, p is very close to c, and c equals to c_max. This means there is no memory pressure at all, which can be verified by direct or indirect count below. The mru ghost hit is only 917736 I don't know what made p that large. John: If crash dump is problematic due to size, we can reduce the size of memory by appending 'mem=?G' to boot line.

            A full dump can be found at: /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2016-04-09-16\:36\:15/

            For now I am disabling the running of /bin/dbufstat.py (part of zfs) as part of our stat collection.

            jsalians_intel John Salinas (Inactive) added a comment - A full dump can be found at: /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2016-04-09-16\:36\:15/ For now I am disabling the running of /bin/dbufstat.py (part of zfs) as part of our stat collection.
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            This seems to hit every 2-4 hours on my wolf nodes with IOR running. I had to reserve 2G of memory in order to get the dump to complete otherwise while it was dumping it would fail with OOM errors and the dump would not complete. The good news is this is solved, the bad news is /var/crash is not big enough for a dump so we only have a partial dump file:

            [15001.472312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
            [15001.481082] IP: [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
            [15001.487935] PGD 19e8306067 PUD 1d5370b067 PMD 0
            [15001.493131] Oops: 0000 1 SMP
            [15001.496766] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl kvm_intel dm_service_time kvm ipmi_devintf iTCO_wdt iTCO_vendor_support ses enclosure crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mei_me sb_edac mei lpc_ich edac_core shpchp ioatdma pcspkr sg i2c_i801 mfd_core ipmi_ssif ipmi_si ipmi_msghandler acpi_power_meter wmi acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath dm_mod
            [15001.576540] ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper ttm igb ahci crct10dif_pclmul ptp crct10dif_common crc32c_intel pps_core libahci drm mlx4_core mpt3sas dca libata raid_class i2c_algo_bit scsi_transport_sas i2c_core zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate
            [15001.620342] CPU: 55 PID: 27643 Comm: dbufstat.py Tainted: P IOE ------------ 3.10.0-327.10.1.el7.x86_64 #1
            [15001.632280] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
            [15001.643918] task: ffff880ff79cdc00 ti: ffff88039ed8c000 task.ti: ffff88039ed8c000
            [15001.652279] RIP: 0010:[<ffffffffa00b675c>] [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
            [15001.661816] RSP: 0018:ffff88039ed8faf8 EFLAGS: 00010282
            [15001.667739] RAX: ffff8810276cc740 RBX: ffff88092eb80128 RCX: 0000000000000000
            [15001.675695] RDX: 0000000000000000 RSI: ffff88039ed8fd10 RDI: ffff880519a54de0
            [15001.683653] RBP: ffff88039ed8faf8 R08: 0000000000000000 R09: 0000000000000000
            [15001.691610] R10: 0000000000000000 R11: ffff88039ed8f9b6 R12: ffff880fb8f3e630
            [15001.699567] R13: 0000000000001000 R14: ffff88039ed8fcd8 R15: 0000000000001000
            [15001.707538] FS: 00007f2b0f887740(0000) GS:ffff881fbe860000(0000) knlGS:0000000000000000
            [15001.716552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [15001.722959] CR2: 0000000000000028 CR3: 0000001dc7d84000 CR4: 00000000001407e0
            [15001.730917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [15001.738887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [15001.746844] Stack:
            [15001.749084] ffff88039ed8fd98 ffffffffa00c7508 0000000000000000 00000000000469c0
            [15001.757424] 00000008d3800000 0000000000020000 ffffffff00000000 000000d000000004
            [15001.765717] 0000000000000000 00007f2b00000001 ffff881c00000000 ffffffff000200a0
            [15001.774044] Call Trace:
            [15001.776792] [<ffffffffa00c7508>] __dbuf_stats_hash_table_data+0x78/0x400 [zfs]
            [15001.784965] [<ffffffffa00c7a07>] dbuf_stats_hash_table_data+0x177/0x2d0 [zfs]
            [15001.793040] [<ffffffffa0010d63>] kstat_seq_show+0x123/0x3e0 [spl]
            [15001.799948] [<ffffffff81202428>] seq_read+0x238/0x3a0
            [15001.805689] [<ffffffff8124938d>] proc_reg_read+0x3d/0x80
            [15001.811721] [<ffffffff811de49c>] vfs_read+0x9c/0x170
            [15001.817353] [<ffffffff811defef>] SyS_read+0x7f/0xe0
            [15001.822899] [<ffffffff81645a49>] system_call_fastpath+0x16/0x1b
            [15001.829606] Code: 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 49 8b 74 24 50 e9 ed fe ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 4c 8b 0f 48 89 e5 <41> 8b 41 28 a9 00 00 02 00 0f 84 05 01 00 00 4d 8b 99 f8 00 00
            [15001.851548] RIP [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
            [15001.858505] RSP <ffff88039ed8faf8>
            [15001.862395] CR2: 0000000000000028

            jsalians_intel John Salinas (Inactive) added a comment - - edited This seems to hit every 2-4 hours on my wolf nodes with IOR running. I had to reserve 2G of memory in order to get the dump to complete otherwise while it was dumping it would fail with OOM errors and the dump would not complete. The good news is this is solved, the bad news is /var/crash is not big enough for a dump so we only have a partial dump file: [15001.472312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 [15001.481082] IP: [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs] [15001.487935] PGD 19e8306067 PUD 1d5370b067 PMD 0 [15001.493131] Oops: 0000 1 SMP [15001.496766] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl kvm_intel dm_service_time kvm ipmi_devintf iTCO_wdt iTCO_vendor_support ses enclosure crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mei_me sb_edac mei lpc_ich edac_core shpchp ioatdma pcspkr sg i2c_i801 mfd_core ipmi_ssif ipmi_si ipmi_msghandler acpi_power_meter wmi acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath dm_mod [15001.576540] ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper ttm igb ahci crct10dif_pclmul ptp crct10dif_common crc32c_intel pps_core libahci drm mlx4_core mpt3sas dca libata raid_class i2c_algo_bit scsi_transport_sas i2c_core zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate [15001.620342] CPU: 55 PID: 27643 Comm: dbufstat.py Tainted: P IOE ------------ 3.10.0-327.10.1.el7.x86_64 #1 [15001.632280] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [15001.643918] task: ffff880ff79cdc00 ti: ffff88039ed8c000 task.ti: ffff88039ed8c000 [15001.652279] RIP: 0010: [<ffffffffa00b675c>] [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs] [15001.661816] RSP: 0018:ffff88039ed8faf8 EFLAGS: 00010282 [15001.667739] RAX: ffff8810276cc740 RBX: ffff88092eb80128 RCX: 0000000000000000 [15001.675695] RDX: 0000000000000000 RSI: ffff88039ed8fd10 RDI: ffff880519a54de0 [15001.683653] RBP: ffff88039ed8faf8 R08: 0000000000000000 R09: 0000000000000000 [15001.691610] R10: 0000000000000000 R11: ffff88039ed8f9b6 R12: ffff880fb8f3e630 [15001.699567] R13: 0000000000001000 R14: ffff88039ed8fcd8 R15: 0000000000001000 [15001.707538] FS: 00007f2b0f887740(0000) GS:ffff881fbe860000(0000) knlGS:0000000000000000 [15001.716552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15001.722959] CR2: 0000000000000028 CR3: 0000001dc7d84000 CR4: 00000000001407e0 [15001.730917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15001.738887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15001.746844] Stack: [15001.749084] ffff88039ed8fd98 ffffffffa00c7508 0000000000000000 00000000000469c0 [15001.757424] 00000008d3800000 0000000000020000 ffffffff00000000 000000d000000004 [15001.765717] 0000000000000000 00007f2b00000001 ffff881c00000000 ffffffff000200a0 [15001.774044] Call Trace: [15001.776792] [<ffffffffa00c7508>] __dbuf_stats_hash_table_data+0x78/0x400 [zfs] [15001.784965] [<ffffffffa00c7a07>] dbuf_stats_hash_table_data+0x177/0x2d0 [zfs] [15001.793040] [<ffffffffa0010d63>] kstat_seq_show+0x123/0x3e0 [spl] [15001.799948] [<ffffffff81202428>] seq_read+0x238/0x3a0 [15001.805689] [<ffffffff8124938d>] proc_reg_read+0x3d/0x80 [15001.811721] [<ffffffff811de49c>] vfs_read+0x9c/0x170 [15001.817353] [<ffffffff811defef>] SyS_read+0x7f/0xe0 [15001.822899] [<ffffffff81645a49>] system_call_fastpath+0x16/0x1b [15001.829606] Code: 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 49 8b 74 24 50 e9 ed fe ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 4c 8b 0f 48 89 e5 <41> 8b 41 28 a9 00 00 02 00 0f 84 05 01 00 00 4d 8b 99 f8 00 00 [15001.851548] RIP [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs] [15001.858505] RSP <ffff88039ed8faf8> [15001.862395] CR2: 0000000000000028

            Hi John,

            Can you please reset the ZFS release to the commit zfs-0.6.4-92 and run it one more time? zfs-0.6.4-92 doesn't include the ARC change this way it will help us make sure that we reproduced LU-7404.

            jay Jinshan Xiong (Inactive) added a comment - Hi John, Can you please reset the ZFS release to the commit zfs-0.6.4-92 and run it one more time? zfs-0.6.4-92 doesn't include the ARC change this way it will help us make sure that we reproduced LU-7404 .

            People

              jay Jinshan Xiong (Inactive)
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: