[LU-7987] Lustre 2.8 OSS with zfs 0.6.5 backend hitting most schedule_timeout Created: 05/Apr/16  Updated: 14/Jun/18  Resolved: 13/Apr/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: John Salinas (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Low Priority Votes: 0
Labels: LS_RZ

Attachments: File lustre-log.1459803990.12256.gz     File wolf-3-oss0-messages-20160403.gz     File wolf-4-oss1-messages-20160403.gz     File wolf-4-oss1-messages.gz     File wolf-4-ost1-stats.tgz     File wolf-5-mds-messages-20160404.gz     File wolf-6-client-messages.gz     File wolf-7-client-messages-20160403.gz    
Issue Links:
Related
is related to LU-7404 ZFS OSS - Numerous timeouts - SWL Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While trying to test for LU-7404 OST timeout the one of the OSS nodes became unaccessible via ssh and conman.

This was running with 2 OSS nodes running 1 zfs zpool on each, 1 mds, and 4 clients running IOR. The first message that was seen was fairly soon after starting. About 2 hours after running lustre stopped responding to lustre clients and static scripts running on the OSS stopped running.

First message seen was:
Apr 4 21:06:29 wolf-4 kernel: LNet: Service thread pid 12256 was inactive for 200.30s. The thread might be hung, or it might only be slow an
d will resume later. Dumping the stack trace for debugging purposes:

Apr  4 21:06:29 wolf-4 kernel: Pid: 12256, comm: ll_ost02_000
Apr  4 21:06:29 wolf-4 kernel: #012Call Trace:
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a9e9>] schedule+0x29/0x70
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff816386d9>] schedule_timeout+0x209/0x2d0
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa012dc0c>] ? vdev_mirror_io_start+0xac/0x190 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa012d150>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8101c829>] ? read_tsc+0x9/0x10
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810d814c>] ? ktime_get_ts64+0x4c/0xf0
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a01e>] io_schedule_timeout+0xae/0x130
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff8163a0b8>] io_schedule+0x18/0x20
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00116e0>] cv_wait_common+0xb0/0x150 [spl]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00117d8>] __cv_wait_io+0x18/0x20 [spl]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa016db83>] zio_wait+0x123/0x210 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00c44ad>] dbuf_read+0x30d/0x930 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00c60eb>] dmu_buf_will_dirty+0x4b/0xa0 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00deb46>] dnode_dirty_l1+0x36/0x50 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00e146a>] dnode_free_range+0x4da/0x6b0 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa00cb829>] dmu_free_long_range+0x1d9/0x280 [zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa11fe468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa11fe7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa120112a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12fef9f>] ofd_trans_stop+0x1f/0x60 [ofd]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa1302bb2>] ofd_object_destroy+0x2b2/0x890 [ofd]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12fb8f7>] ofd_destroy_by_fid+0x307/0x510 [ofd]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa0fce170>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa0fc8730>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa12edc67>] ofd_destroy_hdl+0x267/0xa50 [ofd]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa105af65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
Apr  4 21:06:29 wolf-4 kernel: [<ffffffffa1007beb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa0cec578>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa1006498>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa100bc90>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
Apr  4 21:06:30 wolf-4 kernel: [<ffffffffa100b200>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff81645998>] ret_from_fork+0x58/0x90
Apr  4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0
Apr  4 21:06:30 wolf-4 kernel: 
Apr  4 21:06:30 wolf-4 kernel: LustreError: dumping log to /tmp/lustre-log.1459803990.12256
Apr  4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied
Apr  4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied
Apr  4 21:07:19 wolf-4 kernel: LNet: Service thread pid 12256 completed after 249.57s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 

But the system appeared to keep running past that and then come to a grinding halt.

Run began: Mon Apr  4 23:25:07 2016
Command line used: /usr/bin/IOR -i 3 -v -b 50G -t 128k
Machine: Linux wolf-6.wolf.hpdd.intel.com
Start time skew across all tasks: 8.97 sec
Path: /mnt/lustre
FS: 37.9 TiB   Used FS: 0.6%   Inodes: 171.8 Mi   Used Inodes: 32.0%
Participating tasks: 4

Summary:
        api                = POSIX
        test filename      = testFile
        access             = single-shared-file
        pattern            = segmented (1 segment)
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 4 (1 per node)
        repetitions        = 3
        xfersize           = 131072 bytes
        blocksize          = 50 GiB
        aggregate filesize = 200 GiB

No messages appeared on wolf-4 OST1 but it was noticed by clients:

Apr  4 23:37:12 wolf-6 kernel: Lustre: 2811:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813025/real 0]  req@ffff880d8570ad00 x1530711614167736/t0(0) o400->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 224/224 e 0 to 1 dl 1459813032 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Apr  4 23:37:12 wolf-6 kernel: Lustre: coral-OST0001-osc-ffff881030662800: Connection to coral-OST0001 (at 192.168.1.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Apr  4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1459812992/real 1459812992]  req@ffff880266e0bc00 x1530711614167672/t0(0) o3->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:6/4 lens 608/432 e 0 to 1 dl 1459813036 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr  4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Apr  4 23:37:18 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813032/real 0]  req@ffff880bbe376900 x1530711614167740/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813038 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Apr  4 23:37:37 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813057/real 1459813057]  req@ffff880bbe375d00 x1530711614167756/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813068 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Apr  4 23:38:02 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813082/real 1459813082]  req@ffff880bbe375100 x1530711614167772/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813098 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Apr  4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied
Apr  4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied
Apr  4 23:38:27 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813107/real 1459813107]  req@ffff880bbe374500 x1530711614167788/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813128 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Apr  4 23:38:52 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813132/real 1459813132]  req@ffff880bbe373900 x1530711614167804/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813158 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

From stats being captured on OST1:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:30:53     0     0      0     0    0     0    0     0    0   200M   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:31:29     0     0      0     0    0     0    0     0    0   200M   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:32:06     0     0      0     0    0     0    0     0    0   200M   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:32:42     0     0      0     0    0     0    0     0    0   200M   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:33:19     0     0      0     0    0     0    0     0    0   200M   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:33:55     0     0      0     0    0     0    0     0    0    62G   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:34:36     0     0      0     0    0     0    0     0    0    62G   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:35:18    10    10    100    10  100     0    0     0    0    62G   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:35:58    91    91    100    47  100    44  100     1  100    63G   62G  
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
23:36:39     5     5    100     5  100     0    0     0    0    62G   62G  
6 1 0x01 91 4368 5368758848 527202119883608
name                            type data
hits                            4    91379399
misses                          4    70802556
demand_data_hits                4    62879518
demand_data_misses              4    52414186
demand_metadata_hits            4    23738570
demand_metadata_misses          4    1105675
prefetch_data_hits              4    4757199
prefetch_data_misses            4    17281282
prefetch_metadata_hits          4    4112
prefetch_metadata_misses        4    1413
mru_hits                        4    62351920
mru_ghost_hits                  4    917736
mfu_hits                        4    24266169
mfu_ghost_hits                  4    632285
deleted                         4    116650113
mutex_miss                      4    447
evict_skip                      4    184
evict_not_enough                4    62
evict_l2_cached                 4    0
evict_l2_eligible               4    15019191377920
evict_l2_ineligible             4    408751927296
evict_l2_skip                   4    0
hash_elements                   4    1059177
hash_elements_max               4    1126648
hash_collisions                 4    11087755
hash_chains                     4    31686
hash_chain_max                  4    4
p                               4    63241464064
c                               4    67457282048
c_min                           4    33554432
c_max                           4    67457282048
size                            4    67456702536
hdr_size                        4    396803920
data_size                       4    66823651328
metadata_size                   4    76845056
other_size                      4    159402232
anon_size                       4    39337984
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    64780652544
mru_evictable_data              4    64752189440
mru_evictable_metadata          4    327680
mru_ghost_size                  4    2679063040
mru_ghost_evictable_data        4    2082865152
mru_ghost_evictable_metadata    4    596197888
mfu_size                        4    2080505856
mfu_evictable_data              4    2024931328
mfu_evictable_metadata          4    3112960
mfu_ghost_size                  4    56092389888
mfu_ghost_evictable_data        4    56057790464
mfu_ghost_evictable_metadata    4    34599424
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    633051832
arc_meta_limit                  4    50592961536
arc_meta_max                    4    809615080
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2108039168 

proc_arcstats.2016-04-04:21:52:28:misses                          4    60302081
proc_arcstats.2016-04-04:21:53:05:misses                          4    60302081
proc_arcstats.2016-04-04:21:53:41:misses                          4    60302081
proc_arcstats.2016-04-04:21:54:18:misses                          4    60302081
proc_arcstats.2016-04-04:21:54:55:misses                          4    60302081
proc_arcstats.2016-04-04:21:55:31:misses                          4    60302081
proc_arcstats.2016-04-04:21:56:08:misses                          4    60302081
proc_arcstats.2016-04-04:21:56:44:misses                          4    60302081
proc_arcstats.2016-04-04:21:57:21:misses                          4    60302081
proc_arcstats.2016-04-04:21:57:57:misses                          4    60302081
proc_arcstats.2016-04-04:21:58:34:misses                          4    60302081
proc_arcstats.2016-04-04:21:59:11:misses                          4    60302081
proc_arcstats.2016-04-04:21:59:47:misses                          4    60302099
proc_arcstats.2016-04-04:22:00:25:misses                          4    60302137
proc_arcstats.2016-04-04:22:01:06:misses                          4    60374920
proc_arcstats.2016-04-04:22:01:47:misses                          4    60513681
proc_arcstats.2016-04-04:22:02:27:misses                          4    60644796
proc_arcstats.2016-04-04:22:03:08:misses                          4    60782163
proc_arcstats.2016-04-04:22:03:49:misses                          4    60921202
proc_arcstats.2016-04-04:22:04:30:misses                          4    61049984
proc_arcstats.2016-04-04:22:05:10:misses                          4    61131959
proc_arcstats.2016-04-04:22:05:51:misses                          4    61133558
proc_arcstats.2016-04-04:22:06:32:misses                          4    61203715
proc_arcstats.2016-04-04:22:07:13:misses                          4    61317428
proc_arcstats.2016-04-04:22:07:54:misses                          4    61406420
proc_arcstats.2016-04-04:22:08:35:misses                          4    61520867
proc_arcstats.2016-04-04:22:09:15:misses                          4    61659833
proc_arcstats.2016-04-04:22:09:56:misses                          4    61796365
proc_arcstats.2016-04-04:22:10:37:misses                          4    61905346
proc_arcstats.2016-04-04:22:11:17:misses                          4    61977003
proc_arcstats.2016-04-04:22:11:56:misses                          4    61977043
proc_arcstats.2016-04-04:22:12:37:misses                          4    62083132
proc_arcstats.2016-04-04:22:13:17:misses                          4    62215655
proc_arcstats.2016-04-04:22:13:57:misses                          4    62348643
proc_arcstats.2016-04-04:22:14:38:misses                          4    62494512
proc_arcstats.2016-04-04:22:15:18:misses                          4    62630944
proc_arcstats.2016-04-04:22:15:59:misses                          4    62727592
proc_arcstats.2016-04-04:22:16:38:misses                          4    62806912
proc_arcstats.2016-04-04:22:17:18:misses                          4    62809594
proc_arcstats.2016-04-04:22:17:59:misses                          4    62832835
proc_arcstats.2016-04-04:22:18:40:misses                          4    62874008
proc_arcstats.2016-04-04:22:19:20:misses                          4    62915527
proc_arcstats.2016-04-04:22:20:00:misses                          4    62956780
proc_arcstats.2016-04-04:22:20:40:misses                          4    62998788
proc_arcstats.2016-04-04:22:21:20:misses                          4    63041800
proc_arcstats.2016-04-04:22:22:01:misses                          4    63083422
proc_arcstats.2016-04-04:22:22:41:misses                          4    63125355
proc_arcstats.2016-04-04:22:23:21:misses                          4    63167419
proc_arcstats.2016-04-04:22:24:01:misses                          4    63210702
proc_arcstats.2016-04-04:22:24:41:misses                          4    63252813
proc_arcstats.2016-04-04:22:25:21:misses                          4    63295877
proc_arcstats.2016-04-04:22:26:01:misses                          4    63339516
proc_arcstats.2016-04-04:22:26:41:misses                          4    63383531
proc_arcstats.2016-04-04:22:27:22:misses                          4    63417625
proc_arcstats.2016-04-04:22:28:02:misses                          4    63418160
proc_arcstats.2016-04-04:22:28:44:misses                          4    63418448
proc_arcstats.2016-04-04:22:29:25:misses                          4    63446807
proc_arcstats.2016-04-04:22:30:06:misses                          4    63605174
proc_arcstats.2016-04-04:22:30:46:misses                          4    63787444
proc_arcstats.2016-04-04:22:31:27:misses                          4    63991757
proc_arcstats.2016-04-04:22:32:08:misses                          4    64186064
proc_arcstats.2016-04-04:22:32:48:misses                          4    64370507
proc_arcstats.2016-04-04:22:33:29:misses                          4    64544830
proc_arcstats.2016-04-04:22:34:10:misses                          4    64705200
proc_arcstats.2016-04-04:22:34:51:misses                          4    64890908
proc_arcstats.2016-04-04:22:35:31:misses                          4    65051186
proc_arcstats.2016-04-04:22:36:12:misses                          4    65093562
proc_arcstats.2016-04-04:22:36:48:misses                          4    65093566
proc_arcstats.2016-04-04:22:37:25:misses                          4    65093566
proc_arcstats.2016-04-04:22:38:02:misses                          4    65093566
proc_arcstats.2016-04-04:22:38:38:misses                          4    65093566
proc_arcstats.2016-04-04:22:39:15:misses                          4    65093566
proc_arcstats.2016-04-04:22:39:51:misses                          4    65093566
proc_arcstats.2016-04-04:22:40:28:misses                          4    65093566
proc_arcstats.2016-04-04:22:41:05:misses                          4    65093566
proc_arcstats.2016-04-04:22:41:41:misses                          4    65093566
proc_arcstats.2016-04-04:22:42:18:misses                          4    65093566
proc_arcstats.2016-04-04:22:42:54:misses                          4    65093566
proc_arcstats.2016-04-04:22:43:31:misses                          4    65093566
proc_arcstats.2016-04-04:22:44:07:misses                          4    65093566
proc_arcstats.2016-04-04:22:44:44:misses                          4    65093593
proc_arcstats.2016-04-04:22:45:24:misses                          4    65093635
proc_arcstats.2016-04-04:22:46:05:misses                          4    65093688
proc_arcstats.2016-04-04:22:46:47:misses                          4    65204992
proc_arcstats.2016-04-04:22:47:27:misses                          4    65369668
proc_arcstats.2016-04-04:22:48:08:misses                          4    65543276
proc_arcstats.2016-04-04:22:48:49:misses                          4    65700706
proc_arcstats.2016-04-04:22:49:29:misses                          4    65835289
proc_arcstats.2016-04-04:22:50:10:misses                          4    65978426
proc_arcstats.2016-04-04:22:50:51:misses                          4    66102294
proc_arcstats.2016-04-04:22:51:32:misses                          4    66237487
proc_arcstats.2016-04-04:22:52:12:misses                          4    66372029
proc_arcstats.2016-04-04:22:52:53:misses                          4    66491789
proc_arcstats.2016-04-04:22:53:33:misses                          4    66606048
proc_arcstats.2016-04-04:22:54:14:misses                          4    66728350
proc_arcstats.2016-04-04:22:54:54:misses                          4    66759150
proc_arcstats.2016-04-04:22:55:35:misses                          4    66783980
proc_arcstats.2016-04-04:22:56:16:misses                          4    66882343
proc_arcstats.2016-04-04:22:56:56:misses                          4    66986649
proc_arcstats.2016-04-04:22:57:37:misses                          4    67099214
proc_arcstats.2016-04-04:22:58:18:misses                          4    67236074
proc_arcstats.2016-04-04:22:58:58:misses                          4    67359854
proc_arcstats.2016-04-04:22:59:39:misses                          4    67479208
proc_arcstats.2016-04-04:23:00:19:misses                          4    67603703
proc_arcstats.2016-04-04:23:01:00:misses                          4    67605334
proc_arcstats.2016-04-04:23:01:42:misses                          4    67687472
proc_arcstats.2016-04-04:23:02:22:misses                          4    67823336
proc_arcstats.2016-04-04:23:03:03:misses                          4    67950500
proc_arcstats.2016-04-04:23:03:44:misses                          4    68083660
proc_arcstats.2016-04-04:23:04:24:misses                          4    68221409
proc_arcstats.2016-04-04:23:05:04:misses                          4    68335712
proc_arcstats.2016-04-04:23:05:45:misses                          4    68441748
proc_arcstats.2016-04-04:23:06:24:misses                          4    68443307
proc_arcstats.2016-04-04:23:07:05:misses                          4    68454637
proc_arcstats.2016-04-04:23:07:46:misses                          4    68578571
proc_arcstats.2016-04-04:23:08:27:misses                          4    68690534
proc_arcstats.2016-04-04:23:09:07:misses                          4    68808692
proc_arcstats.2016-04-04:23:09:48:misses                          4    68930912
proc_arcstats.2016-04-04:23:10:28:misses                          4    69068534
proc_arcstats.2016-04-04:23:11:09:misses                          4    69192434
proc_arcstats.2016-04-04:23:11:50:misses                          4    69283782
proc_arcstats.2016-04-04:23:12:30:misses                          4    69284522
proc_arcstats.2016-04-04:23:13:12:misses                          4    69315159
proc_arcstats.2016-04-04:23:13:52:misses                          4    69373955
proc_arcstats.2016-04-04:23:14:33:misses                          4    69449303
proc_arcstats.2016-04-04:23:15:13:misses                          4    69523499
proc_arcstats.2016-04-04:23:15:53:misses                          4    69599070
proc_arcstats.2016-04-04:23:16:34:misses                          4    69674993
proc_arcstats.2016-04-04:23:17:14:misses                          4    69750733
proc_arcstats.2016-04-04:23:17:55:misses                          4    69827307
proc_arcstats.2016-04-04:23:18:35:misses                          4    69847427
proc_arcstats.2016-04-04:23:19:16:misses                          4    69848231
proc_arcstats.2016-04-04:23:19:57:misses                          4    69884019
proc_arcstats.2016-04-04:23:20:37:misses                          4    69946465
proc_arcstats.2016-04-04:23:21:18:misses                          4    70016284
proc_arcstats.2016-04-04:23:21:58:misses                          4    70081329
proc_arcstats.2016-04-04:23:22:39:misses                          4    70148143
proc_arcstats.2016-04-04:23:23:19:misses                          4    70213937
proc_arcstats.2016-04-04:23:23:59:misses                          4    70283057
proc_arcstats.2016-04-04:23:24:40:misses                          4    70353187
proc_arcstats.2016-04-04:23:25:21:misses                          4    70415880
proc_arcstats.2016-04-04:23:26:00:misses                          4    70415996
proc_arcstats.2016-04-04:23:26:37:misses                          4    70415996
proc_arcstats.2016-04-04:23:27:13:misses                          4    70415996
proc_arcstats.2016-04-04:23:27:50:misses                          4    70415996
proc_arcstats.2016-04-04:23:28:26:misses                          4    70415996
proc_arcstats.2016-04-04:23:29:03:misses                          4    70415996
proc_arcstats.2016-04-04:23:29:40:misses                          4    70415996
proc_arcstats.2016-04-04:23:30:16:misses                          4    70415996
proc_arcstats.2016-04-04:23:30:53:misses                          4    70415996
proc_arcstats.2016-04-04:23:31:29:misses                          4    70415996
proc_arcstats.2016-04-04:23:32:06:misses                          4    70415996
proc_arcstats.2016-04-04:23:32:42:misses                          4    70415996
proc_arcstats.2016-04-04:23:33:19:misses                          4    70415996
proc_arcstats.2016-04-04:23:33:55:misses                          4    70416034
proc_arcstats.2016-04-04:23:34:36:misses                          4    70416109
proc_arcstats.2016-04-04:23:35:18:misses                          4    70417565
proc_arcstats.2016-04-04:23:35:58:misses                          4    70597975
proc_arcstats.2016-04-04:23:36:39:misses                          4    70802556



 Comments   
Comment by Jinshan Xiong (Inactive) [ 05/Apr/16 ]

Hi John,

Can you please reset the ZFS release to the commit zfs-0.6.4-92 and run it one more time? zfs-0.6.4-92 doesn't include the ARC change this way it will help us make sure that we reproduced LU-7404.

Comment by John Salinas (Inactive) [ 06/Apr/16 ]

This seems to hit every 2-4 hours on my wolf nodes with IOR running. I had to reserve 2G of memory in order to get the dump to complete otherwise while it was dumping it would fail with OOM errors and the dump would not complete. The good news is this is solved, the bad news is /var/crash is not big enough for a dump so we only have a partial dump file:

[15001.472312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[15001.481082] IP: [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
[15001.487935] PGD 19e8306067 PUD 1d5370b067 PMD 0
[15001.493131] Oops: 0000 1 SMP
[15001.496766] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp intel_rapl kvm_intel dm_service_time kvm ipmi_devintf iTCO_wdt iTCO_vendor_support ses enclosure crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mei_me sb_edac mei lpc_ich edac_core shpchp ioatdma pcspkr sg i2c_i801 mfd_core ipmi_ssif ipmi_si ipmi_msghandler acpi_power_meter wmi acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath dm_mod
[15001.576540] ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mlx4_ib vxlan ib_sa ip6_udp_tunnel ib_mad udp_tunnel ib_core ib_addr mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper ttm igb ahci crct10dif_pclmul ptp crct10dif_common crc32c_intel pps_core libahci drm mlx4_core mpt3sas dca libata raid_class i2c_algo_bit scsi_transport_sas i2c_core zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate
[15001.620342] CPU: 55 PID: 27643 Comm: dbufstat.py Tainted: P IOE ------------ 3.10.0-327.10.1.el7.x86_64 #1
[15001.632280] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[15001.643918] task: ffff880ff79cdc00 ti: ffff88039ed8c000 task.ti: ffff88039ed8c000
[15001.652279] RIP: 0010:[<ffffffffa00b675c>] [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
[15001.661816] RSP: 0018:ffff88039ed8faf8 EFLAGS: 00010282
[15001.667739] RAX: ffff8810276cc740 RBX: ffff88092eb80128 RCX: 0000000000000000
[15001.675695] RDX: 0000000000000000 RSI: ffff88039ed8fd10 RDI: ffff880519a54de0
[15001.683653] RBP: ffff88039ed8faf8 R08: 0000000000000000 R09: 0000000000000000
[15001.691610] R10: 0000000000000000 R11: ffff88039ed8f9b6 R12: ffff880fb8f3e630
[15001.699567] R13: 0000000000001000 R14: ffff88039ed8fcd8 R15: 0000000000001000
[15001.707538] FS: 00007f2b0f887740(0000) GS:ffff881fbe860000(0000) knlGS:0000000000000000
[15001.716552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15001.722959] CR2: 0000000000000028 CR3: 0000001dc7d84000 CR4: 00000000001407e0
[15001.730917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15001.738887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15001.746844] Stack:
[15001.749084] ffff88039ed8fd98 ffffffffa00c7508 0000000000000000 00000000000469c0
[15001.757424] 00000008d3800000 0000000000020000 ffffffff00000000 000000d000000004
[15001.765717] 0000000000000000 00007f2b00000001 ffff881c00000000 ffffffff000200a0
[15001.774044] Call Trace:
[15001.776792] [<ffffffffa00c7508>] __dbuf_stats_hash_table_data+0x78/0x400 [zfs]
[15001.784965] [<ffffffffa00c7a07>] dbuf_stats_hash_table_data+0x177/0x2d0 [zfs]
[15001.793040] [<ffffffffa0010d63>] kstat_seq_show+0x123/0x3e0 [spl]
[15001.799948] [<ffffffff81202428>] seq_read+0x238/0x3a0
[15001.805689] [<ffffffff8124938d>] proc_reg_read+0x3d/0x80
[15001.811721] [<ffffffff811de49c>] vfs_read+0x9c/0x170
[15001.817353] [<ffffffff811defef>] SyS_read+0x7f/0xe0
[15001.822899] [<ffffffff81645a49>] system_call_fastpath+0x16/0x1b
[15001.829606] Code: 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 49 8b 74 24 50 e9 ed fe ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 4c 8b 0f 48 89 e5 <41> 8b 41 28 a9 00 00 02 00 0f 84 05 01 00 00 4d 8b 99 f8 00 00
[15001.851548] RIP [<ffffffffa00b675c>] arc_buf_info+0xc/0x190 [zfs]
[15001.858505] RSP <ffff88039ed8faf8>
[15001.862395] CR2: 0000000000000028

Comment by John Salinas (Inactive) [ 09/Apr/16 ]

A full dump can be found at: /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2016-04-09-16\:36\:15/

For now I am disabling the running of /bin/dbufstat.py (part of zfs) as part of our stat collection.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/16 ]

From the arcstats, p is very close to c, and c equals to c_max. This means there is no memory pressure at all, which can be verified by direct or indirect count below. The mru ghost hit is only 917736 I don't know what made p that large.

John: If crash dump is problematic due to size, we can reduce the size of memory by appending 'mem=?G' to boot line.

Comment by John Salinas (Inactive) [ 14/Apr/16 ]

Yes, in the second comment I mention I set mem=2G in order to get a dump. However, I believe the dump is due to a separate issue. I could be wrong on this. Please see the one full dump I have on the /scratch nfs dir on wolf.

Comment by John Salinas (Inactive) [ 14/Apr/16 ]

wolf has been running today and the only error I have seen is this:

 
[26193.993322] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[26194.007089] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) Skipped 17 previous similar messages
[26195.013534] LustreError: 7812:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[26197.058425] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[26197.072181] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) Skipped 31 previous similar messages
[26202.409535] LustreError: 7824:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for usr 0: rc = -52
[26202.473611] LustreError: 7824:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for grp 0: rc = -52
[26213.811138] LustreError: 7823:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for usr 0: rc = -52
[26213.834184] LustreError: 7823:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for grp 0: rc = -52

[root@wolf-4 stat]# [26390.320802] INFO: task ll_ost_io04_001:7405 blocked for more than 120 seconds.
[26390.331510] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26390.342516] ll_ost_io04_001 D 0000000000000000     0  7405      2 0x00000080
[26390.352479]  ffff881fe19f3820 0000000000000046 ffff881fe19d0b80 ffff881fe19f3fd8
[26390.362848]  ffff881fe19f3fd8 ffff881fe19f3fd8 ffff881fe19d0b80 ffff882023826188
[26390.373083]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26390.383196] Call Trace:
[26390.387579]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26390.394805]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26390.403226]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26390.411237]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26390.418873]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26390.426784]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26390.434776]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26390.443327]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26390.451296]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26390.459814]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26390.467655]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26390.475793]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26390.483986]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26390.491214]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26390.500665]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26390.507766]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26390.515856]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26390.524919]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26390.533603]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26390.543120]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26390.551721]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26390.560263]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26390.568210]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26390.575685]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26390.583661]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26390.592883]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26390.599299]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26390.607549]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26390.614505]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26390.622721] INFO: task ll_ost_io06_001:7411 blocked for more than 120 seconds.
[26390.631720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26390.641427] ll_ost_io06_001 D 0000000000000000     0  7411      2 0x00000080
[26390.650360]  ffff881fcfe83820 0000000000000046 ffff881fe19d5080 ffff881fcfe83fd8
[26390.659664]  ffff881fcfe83fd8 ffff881fcfe83fd8 ffff881fe19d5080 ffff882023826188
[26390.668954]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26390.678219] Call Trace:
[26390.681832]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26390.688244]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26390.696005]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26390.703365]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26390.710421]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26390.717767]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26390.725265]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26390.733340]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26390.740903]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26390.749078]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26390.756745]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26390.764682]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26390.772773]  [<ffffffff810bb685>] ? sched_clock_cpu+0x85/0xc0
[26390.780139]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26390.789160]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26390.797788]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26390.807286]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26390.815831]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26390.824360]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26390.832327]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26390.841543]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26390.847934]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26390.856187]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26390.863168]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26390.871421] INFO: task ll_ost_io06_002:7412 blocked for more than 120 seconds.
[26390.880442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26390.890160] ll_ost_io06_002 D 0000000000000000     0  7412      2 0x00000080
[26390.899063]  ffff881fcfea3820 0000000000000046 ffff881fe19d5c00 ffff881fcfea3fd8
[26390.908403]  ffff881fcfea3fd8 ffff881fcfea3fd8 ffff881fe19d5c00 ffff882023826188
[26390.917719]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26390.927047] Call Trace:
[26390.930756]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26390.937302]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26390.945182]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26390.952678]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26390.959892]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26390.967411]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26390.975079]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26390.983333]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26390.991091]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26390.999426]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26391.007107]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26391.015062]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26391.023154]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26391.030285]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26391.039640]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26391.046664]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26391.054686]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26391.063693]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26391.072310]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26391.081784]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26391.090329]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26391.098827]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26391.106759]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26391.114233]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26391.122197]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26391.131389]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26391.137766]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.145954]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26391.152640]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.160581] INFO: task ll_ost_io06_003:7798 blocked for more than 120 seconds.
[26391.169290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26391.178691] ll_ost_io06_003 D 0000000000000000     0  7798      2 0x00000080
[26391.187282]  ffff881fcf707820 0000000000000046 ffff880fe632b980 ffff881fcf707fd8
[26391.196297]  ffff881fcf707fd8 ffff881fcf707fd8 ffff880fe632b980 ffff882023826188
[26391.205367]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26391.214356] Call Trace:
[26391.217695]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26391.223932]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26391.231443]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26391.238673]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26391.245587]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26391.252750]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26391.260036]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26391.267960]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26391.275409]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26391.283418]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26391.290698]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26391.298359]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26391.306122]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26391.312957]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26391.321966]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26391.328705]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26391.336477]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26391.345249]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26391.353534]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26391.362639]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26391.370865]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26391.379030]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26391.386624]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26391.393808]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26391.401443]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26391.410245]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26391.416274]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.424248]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26391.430967]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.438826] INFO: task ll_ost_io04_003:7800 blocked for more than 120 seconds.
[26391.447481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26391.456819] ll_ost_io04_003 D 0000000000000000     0  7800      2 0x00000080
[26391.465395]  ffff881fceb03820 0000000000000046 ffff880fe632d080 ffff881fceb03fd8
[26391.474374]  ffff881fceb03fd8 ffff881fceb03fd8 ffff880fe632d080 ffff882023826188
[26391.483386]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26391.492473] Call Trace:
[26391.495789]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26391.501972]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26391.509493]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26391.516677]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26391.523585]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26391.530798]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26391.538139]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26391.546029]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26391.553411]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26391.561413]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26391.568753]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26391.576388]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26391.584183]  [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc]
[26391.592094]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26391.601111]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26391.607880]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26391.615618]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26391.624310]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26391.632656]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26391.641806]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26391.650048]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26391.658298]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26391.665909]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26391.673105]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26391.680724]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26391.689587]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26391.695667]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.703689]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26391.710360]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.718298] INFO: task ll_ost_io04_005:7805 blocked for more than 120 seconds.
[26391.727050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26391.736388] ll_ost_io04_005 D 0000000000000000     0  7805      2 0x00000080
[26391.744976]  ffff881fcd54f820 0000000000000046 ffff880fe6328b80 ffff881fcd54ffd8
[26391.753892]  ffff881fcd54ffd8 ffff881fcd54ffd8 ffff880fe6328b80 ffff882023826188
[26391.762929]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26391.771955] Call Trace:
[26391.775349]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26391.781550]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26391.789027]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26391.796180]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26391.803088]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26391.810220]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26391.817518]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26391.825398]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26391.832827]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26391.840834]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26391.848185]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26391.855785]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26391.863569]  [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc]
[26391.871459]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26391.880474]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26391.887226]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26391.894979]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26391.903711]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26391.912027]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26391.921212]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26391.929425]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26391.937700]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26391.945284]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26391.952467]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26391.960077]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26391.968982]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26391.975100]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.983005]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26391.989681]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26391.997598] INFO: task ll_ost_io06_005:7810 blocked for more than 120 seconds.
[26392.006309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26392.015707] ll_ost_io06_005 D 0000000000000000     0  7810      2 0x00000080
[26392.024252]  ffff881fc94fb820 0000000000000046 ffff882007258b80 ffff881fc94fbfd8
[26392.033325]  ffff881fc94fbfd8 ffff881fc94fbfd8 ffff882007258b80 ffff882023826188
[26392.042335]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26392.051388] Call Trace:
[26392.054740]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26392.060967]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26392.068507]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26392.075715]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26392.082626]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26392.089842]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26392.097207]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26392.105081]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26392.112560]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26392.120610]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26392.127927]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26392.135570]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26392.143385]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26392.150241]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26392.159307]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26392.166078]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26392.173820]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26392.182548]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26392.190893]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26392.200052]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26392.208379]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26392.216617]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26392.224260]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26392.231463]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26392.239134]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26392.248015]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26392.254058]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.262001]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26392.268656]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.276534] INFO: task ll_ost_io06_006:7811 blocked for more than 120 seconds.
[26392.285240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26392.294631] ll_ost_io06_006 D 0000000000000000     0  7811      2 0x00000080
[26392.303192]  ffff881fc4f4f820 0000000000000046 ffff88200725e780 ffff881fc4f4ffd8
[26392.312202]  ffff881fc4f4ffd8 ffff881fc4f4ffd8 ffff88200725e780 ffff882023826188
[26392.321131]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26392.330084] Call Trace:
[26392.333495]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26392.339767]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26392.347297]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26392.354490]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26392.361328]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26392.368491]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26392.375831]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26392.383699]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26392.391100]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26392.399078]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26392.406398]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26392.414017]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26392.421798]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26392.428654]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26392.437717]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26392.444446]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26392.452174]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26392.460902]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26392.469160]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26392.478343]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26392.486615]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26392.494801]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26392.502410]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26392.509565]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26392.517186]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26392.526081]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26392.532152]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.540030]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26392.546706]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.554619] INFO: task ll_ost_io04_006:7812 blocked for more than 120 seconds.
[26392.563275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26392.572639] ll_ost_io04_006 D 0000000000000000     0  7812      2 0x00000080
[26392.581244]  ffff881fc2dbb820 0000000000000046 ffff88200725f300 ffff881fc2dbbfd8
[26392.590196]  ffff881fc2dbbfd8 ffff881fc2dbbfd8 ffff88200725f300 ffff882023826188
[26392.599200]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26392.608179] Call Trace:
[26392.611461]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26392.617615]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26392.625146]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26392.632336]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26392.639157]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26392.646282]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26392.653554]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26392.661465]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26392.668859]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26392.676864]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26392.684204]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26392.691773]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26392.699584]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26392.706366]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26392.715432]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26392.722155]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26392.729943]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26392.738687]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26392.746981]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26392.756134]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26392.764374]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26392.772528]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26392.780169]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26392.787304]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26392.794941]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26392.803821]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26392.809884]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.817799]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26392.824441]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26392.832351] INFO: task ll_ost_io06_007:7814 blocked for more than 120 seconds.
[26392.841032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26392.850402] ll_ost_io06_007 D 0000000000000000     0  7814      2 0x00000080
[26392.858975]  ffff882003b8f820 0000000000000046 ffff881ffb070b80 ffff882003b8ffd8
[26392.868112]  ffff882003b8ffd8 ffff882003b8ffd8 ffff881ffb070b80 ffff882023826188
[26392.877083]  ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000
[26392.886070] Call Trace:
[26392.889475]  [<ffffffff8163aab9>] schedule+0x29/0x70
[26392.895733]  [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl]
[26392.903261]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[26392.910430]  [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl]
[26392.917280]  [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs]
[26392.924478]  [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs]
[26392.931789]  [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[26392.939703]  [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd]
[26392.947081]  [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd]
[26392.955051]  [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd]
[26392.962394]  [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc]
[26392.970013]  [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc]
[26392.977795]  [<ffffffff81631c85>] ? __slab_free+0x10e/0x277
[26392.984610]  [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd]
[26392.993681]  [<ffffffff810c14bc>] ? update_curr+0xcc/0x150
[26393.000376]  [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0
[26393.008111]  [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[26393.016799]  [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[26393.025125]  [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[26393.034281]  [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[26393.042522]  [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[26393.050759]  [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20
[26393.058388]  [<ffffffff810af028>] ? __wake_up_common+0x58/0x90
[26393.065579]  [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[26393.073196]  [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[26393.082080]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[26393.088140]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[26393.096053]  [<ffffffff81645a58>] ret_from_fork+0x58/0x90
[26393.102718]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[27451.932819] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[27451.945718] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 2 previous similar messages
[27453.007869] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[27453.020784] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message
[27455.047314] LustreError: 7817:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[27462.623135] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5
[27462.624172] LustreError: 7823:0:(ofd_dev.c:1884:ofd_destroy_hdl()) lsdraid-OST0001: error destroying object [0x100010000:0x144:0x0]: -5
[27462.652652] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message
[28445.940895] perf interrupt took too long (5003 > 5000), lowering kernel.perf_event_max_sample_rate to 25000

[root@wolf-4 stat]# [32239.225932] LustreError: 7821:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for usr 0: rc = -52
[32239.248290] LustreError: 7821:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for grp 0: rc = -52

Comment by Jinshan Xiong (Inactive) [ 14/Apr/16 ]

The message "lsdraid-OST0001: can't assign tx: rc = -5" seems serious. One thing I would to do is to check the status of zpool. I will log into the node tomorrow and take a look.

Comment by John Salinas (Inactive) [ 05/May/16 ]

Re-did the configuration so we now have 4 OSTs on 2 OSS like this:
raidz2 11 data disks, 1 spare

  1. zpool list
    NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
    ost0 10T 123G 9.88T - 0% 1% 1.00x ONLINE -
    ost1 10T 123G 9.88T - 0% 1% 1.00x ONLINE -
  2. zpool status -v ost0
    pool: ost0
    state: ONLINE
    scan: none requested
    config:

NAME STATE READ WRITE CKSUM
ost0 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
mpathaj ONLINE 0 0 0
mpathai ONLINE 0 0 0
mpathah ONLINE 0 0 0
mpathag ONLINE 0 0 0
mpathaq ONLINE 0 0 0
mpathap ONLINE 0 0 0
mpathak ONLINE 0 0 0
mpathz ONLINE 0 0 0
mpatham ONLINE 0 0 0
mpathal ONLINE 0 0 0
mpathao ONLINE 0 0 0
spares
mpathan AVAIL

After several hours we hit:
[679742.303316] LNet: Service thread pid 53598 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[679742.324593] Pid: 53598, comm: ll_ost02_003
[679742.331254]
[679742.331254] Call Trace:
[679742.340814] [<ffffffff8163ae49>] schedule+0x29/0x70
[679742.348390] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
[679742.357295] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
[679742.366769] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
[679742.374128] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
[679742.382209] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
[679742.390536] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
[679742.399626] [<ffffffff8163a518>] io_schedule+0x18/0x20
[679742.407025] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
[679742.415409] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
[679742.424171] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
[679742.432274] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
[679742.440120] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
[679742.448070] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
[679742.456677] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
[679742.465269] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
[679742.473940] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
[679742.483415] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
[679742.492815] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
[679742.501381] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
[679742.509314] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
[679742.517793] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
[679742.526400] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[679742.535106] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
[679742.543932] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
[679742.552208] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[679742.561087] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[679742.570871] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[679742.579628] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[679742.588394] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
[679742.596477] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
[679742.604093] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[679742.612254] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
[679742.620262] [<ffffffff810a5aef>] kthread+0xcf/0xe0
[679742.626677] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
[679742.633165] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
[679742.640131] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
[679742.646598]
[679742.649149] LustreError: dumping log to /tmp/lustre-log.1462399201.53598
[680136.855471] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
[680136.855471] req@ffff880fd8048c50 x1533409120101488/t4298772688(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 24 to 0 dl 1462399600 ref 2 fl Interpret:/0/0 rc 0/0
[680143.700879] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
[680143.713507] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)

134217728 ns 4 819
268435456 ns 4 331
536870912 ns 4 1
[root@wolf-3 proc]# cat spl/kstat/zfs/dmu_tx
5 1 0x01 11 528 5351754919 689457420447816
name type data
dmu_tx_assigned 4 7598469
dmu_tx_delay 4 0
dmu_tx_error 4 0
dmu_tx_suspended 4 0
dmu_tx_group 4 2209
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
dmu_tx_dirty_throttle 4 0
dmu_tx_dirty_delay 4 5878718
dmu_tx_dirty_over_max 4 43
dmu_tx_quota 4 0

[335587.853762] LNet: Service thread pid 62899 was inactive for 200.63s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[335587.875541] Pid: 62899, comm: ll_ost02_003
[335587.881750]
[335587.881750] Call Trace:
[335587.888632] [<ffffffff8163ae49>] schedule+0x29/0x70
[335587.895317] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
[335587.902969] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f
[335587.910044] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
[335587.918461] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
[335587.927211] [<ffffffff8163a518>] io_schedule+0x18/0x20
[335587.934223] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
[335587.942193] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
[335587.950855] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
[335587.958626] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
[335587.966154] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
[335587.973730] [<ffffffff811c0505>] ? kfree+0x5/0x140
[335587.980383] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
[335587.989011] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
[335587.997760] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
[335588.006710] [<ffffffffa1267468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
[335588.016378] [<ffffffffa12677d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
[335588.026054] [<ffffffffa126a12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
[335588.034667] [<ffffffffa13683bf>] ofd_trans_stop+0x1f/0x60 [ofd]
[335588.042533] [<ffffffffa136bfd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
[335588.051511] [<ffffffffa1364d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
[335588.060174] [<ffffffffa10351b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[335588.068843] [<ffffffffa102f770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
[335588.078141] [<ffffffffa1356fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
[335588.086420] [<ffffffffa10c2e65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[335588.095262] [<ffffffffa106f54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[335588.104915] [<ffffffff81647a71>] ? ftrace_call+0x5/0x2f
[335588.112281] [<ffffffffa106ddf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[335588.121002] [<ffffffffa10735f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[335588.129141] [<ffffffffa1072b60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
[335588.137394] [<ffffffff810a5aef>] kthread+0xcf/0xe0
[335588.143860] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
[335588.150534] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
[335588.157549] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
[335588.164131]
[335588.166700] LustreError: dumping log to /tmp/lustre-log.1461715083.62899
[335981.877892] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
[335981.877892] req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 19 to 0 dl 1461715482 ref 2 fl Interpret:/0/0 rc 0/0
[335981.913494] Lustre: 62655:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
[335988.734461] Lustre: lsdraid-OST0000: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
[335988.746876] Lustre: lsdraid-OST0000: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
[335988.759401] Lustre: lsdraid-OST0000: deleting orphan objects from 0x0:612 to 0x0:641
[336371.937986] Lustre: 62899:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:385s); client may timeout. req@ffff880fd8420c50 x1532585040628720/t4313491535(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 19 to 0 dl 1461715482 ref 1 fl Complete:/0/0 rc 0/0
[336371.973053] LNet: Service thread pid 62899 completed after 985.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

  1. cat spl/kstat/zfs/dmu_tx
    5 1 0x01 11 528 5351754919 739071997976145
    name type data
    dmu_tx_assigned 4 15535972
    dmu_tx_delay 4 0
    dmu_tx_error 4 0
    dmu_tx_suspended 4 0
    dmu_tx_group 4 2322
    dmu_tx_memory_reserve 4 0
    dmu_tx_memory_reclaim 4 0
    dmu_tx_dirty_throttle 4 0
    dmu_tx_dirty_delay 4 13023543
    dmu_tx_dirty_over_max 4 133
    dmu_tx_quota 4 0
    [root@wolf-3 proc]# [739014.340130] Lustre: 53889:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:431s); client may timeout. req@ffff880fd804d450 x1533409120922528/t4298730344(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0
    [739028.651612] LNet: Service thread pid 53598 was inactive for 1200.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
    [739028.675865] Pid: 53598, comm: ll_ost02_003
    [739028.682914]
    [739028.682914] Call Trace:
    [739028.692109] [<ffffffff8163ae49>] schedule+0x29/0x70
    [739028.699880] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
    [739028.709055] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
    [739028.718690] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
    [739028.726286] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
    [739028.734553] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
    [739028.743102] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
    [739028.752390] [<ffffffff8163a518>] io_schedule+0x18/0x20
    [739028.760058] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
    [739028.768673] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
    [739028.777539] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
    [739028.785539] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
    [739028.793281] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
    [739028.801142] [<ffffffffa00f8aa6>] dmu_tx_count_free+0x316/0x9d0 [zfs]
    [739028.809709] [<ffffffffa00f92ba>] dmu_tx_hold_free+0x15a/0x310 [zfs]
    [739028.818152] [<ffffffffa00e3a1b>] dmu_free_long_range+0x1eb/0x370 [zfs]
    [739028.826887] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
    [739028.836333] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
    [739028.845744] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
    [739028.854301] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
    [739028.862236] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
    [739028.870718] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
    [739028.879272] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
    [739028.887981] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
    [739028.896822] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
    [739028.905006] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
    [739028.913852] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
    [739028.923532] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
    [739028.932235] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
    [739028.940874] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
    [739028.948909] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
    [739028.956470] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
    [739028.964477] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
    [739028.972438] [<ffffffff810a5aef>] kthread+0xcf/0xe0
    [739028.978828] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
    [739028.985305] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
    [739028.992242] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
    [739028.998681]
    [739029.001223] LustreError: dumping log to /tmp/lustre-log.1462458520.53598
    [739136.811037] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:554s); client may timeout. req@ffff880fd804d050 x1533409120922576/t4305999451(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462458074 ref 1 fl Complete:/0/0 rc 0/0
    [739136.845944] LNet: Service thread pid 53598 completed after 1308.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
    [740277.248500] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (3/-152), not sending early reply
    [740277.248500] req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/432 e 0 to 0 dl 1462459772 ref 2 fl Interpret:/0/0 rc 0/0
    [740277.287603] Lustre: 53889:0:(service.c:1335:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
    [740281.780696] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
    [740281.793398] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
    [740281.806603] Lustre: Skipped 1 previous similar message
    [740281.814757] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:609
    [740727.562148] LNet: Service thread pid 53598 was inactive for 1202.40s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
    [740727.585245] Pid: 53598, comm: ll_ost02_003
    [740727.591721]
    [740727.591721] Call Trace:
    [740727.599850] [<ffffffff8163ae49>] schedule+0x29/0x70
    [740727.607143] [<ffffffff81638b39>] schedule_timeout+0x209/0x2d0
    [740727.615515] [<ffffffffa0162b60>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
    [740727.624759] [<ffffffff8101c829>] ? read_tsc+0x9/0x10
    [740727.632053] [<ffffffff810d816c>] ? ktime_get_ts64+0x4c/0xf0
    [740727.639920] [<ffffffff8163a47e>] io_schedule_timeout+0xae/0x130
    [740727.648166] [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90
    [740727.657014] [<ffffffff8163a518>] io_schedule+0x18/0x20
    [740727.664332] [<ffffffffa0014b34>] cv_wait_common+0xf4/0x2d0 [spl]
    [740727.672529] [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40
    [740727.681141] [<ffffffffa0014d68>] __cv_wait_io+0x18/0x20 [spl]
    [740727.689033] [<ffffffffa01b3ccb>] zio_wait+0x1fb/0x3e0 [zfs]
    [740727.696713] [<ffffffffa00d8509>] dbuf_read+0x389/0xda0 [zfs]
    [740727.704421] [<ffffffffa00dcad0>] dmu_buf_will_dirty+0x70/0x120 [zfs]
    [740727.712909] [<ffffffffa00fc5d6>] dnode_dirty_l1+0x36/0x50 [zfs]
    [740727.720855] [<ffffffffa010197a>] dnode_free_range+0x4fa/0xbb0 [zfs]
    [740727.729183] [<ffffffffa0153095>] ? txg_rele_to_quiesce+0x45/0x80 [zfs]
    [740727.737748] [<ffffffffa00f6fae>] ? dmu_tx_assign+0x50e/0x760 [zfs]
    [740727.745922] [<ffffffffa00e3a45>] dmu_free_long_range+0x215/0x370 [zfs]
    [740727.754422] [<ffffffffa101b468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs]
    [740727.763708] [<ffffffffa101b7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs]
    [740727.772977] [<ffffffffa101e12a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs]
    [740727.781391] [<ffffffffa13a63bf>] ofd_trans_stop+0x1f/0x60 [ofd]
    [740727.789208] [<ffffffffa13a9fd2>] ofd_object_destroy+0x2b2/0x890 [ofd]
    [740727.797612] [<ffffffffa13a2d17>] ofd_destroy_by_fid+0x307/0x510 [ofd]
    [740727.806069] [<ffffffffa10921b0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
    [740727.814704] [<ffffffffa108c770>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
    [740727.823508] [<ffffffffa1394fc7>] ofd_destroy_hdl+0x267/0xa50 [ofd]
    [740727.831667] [<ffffffffa111fe65>] tgt_request_handle+0x915/0x1320 [ptlrpc]
    [740727.840467] [<ffffffffa10cc54b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
    [740727.850109] [<ffffffffa0d5e488>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
    [740727.858775] [<ffffffffa10cadf8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
    [740727.867381] [<ffffffff810b8c42>] ? default_wake_function+0x12/0x20
    [740727.875391] [<ffffffff810af048>] ? __wake_up_common+0x58/0x90
    [740727.882922] [<ffffffffa10d05f0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
    [740727.890901] [<ffffffffa10cfb60>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
    [740727.898821] [<ffffffff810a5aef>] kthread+0xcf/0xe0
    [740727.905168] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
    [740727.911593] [<ffffffff81645e18>] ret_from_fork+0x58/0x90
    [740727.918478] [<ffffffff810a5a20>] ? kthread+0x0/0xe0
    [740727.924859]
    [740727.927331] LustreError: dumping log to /tmp/lustre-log.1462460220.53598
    [741038.149730] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
    [741038.162186] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
    [741038.175159] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:641
    [741793.608101] Lustre: lsdraid-OST0001: Client lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib) reconnecting
    [741793.620544] Lustre: lsdraid-OST0001: Connection restored to lsdraid-MDT0000-mdtlov_UUID (at 192.168.1.5@o2ib)
    [741793.633589] Lustre: lsdraid-OST0001: deleting orphan objects from 0x0:579 to 0x0:673
    [742172.088263] Lustre: 53598:0:(service.c:2096:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:1893s); client may timeout. req@ffff880fd8049050 x1533409120946656/t4299258359(0) o6->lsdraid-MDT0000-mdtlov_UUID@192.168.1.5@o2ib:-1/-1 lens 664/400 e 0 to 0 dl 1462459772 ref 1 fl Complete:/0/0 rc 0/0
    [742172.123285] LNet: Service thread pid 53598 completed after 2647.75s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

It appears things from the client perspective things have deadlocked and are not making progress anymore. Going to reboot OSS0/wolf-3.

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

This is over a year old – I don't recall hitting this on 2.9.0 with ZFS 0.7.0. RC3

Generated at Sat Feb 10 07:53:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.