[LU-7987] Lustre 2.8 OSS with zfs 0.6.5 backend hitting most schedule_timeout Created: 05/Apr/16 Updated: 14/Jun/18 Resolved: 13/Apr/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | John Salinas (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Low Priority | Votes: | 0 |
| Labels: | LS_RZ | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
While trying to test for This was running with 2 OSS nodes running 1 zfs zpool on each, 1 mds, and 4 clients running IOR. The first message that was seen was fairly soon after starting. About 2 hours after running lustre stopped responding to lustre clients and static scripts running on the OSS stopped running. First message seen was: Apr 4 21:06:29 wolf-4 kernel: Pid: 12256, comm: ll_ost02_000 Apr 4 21:06:29 wolf-4 kernel: #012Call Trace: Apr 4 21:06:29 wolf-4 kernel: [<ffffffff8163a9e9>] schedule+0x29/0x70 Apr 4 21:06:29 wolf-4 kernel: [<ffffffff816386d9>] schedule_timeout+0x209/0x2d0 Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa012dc0c>] ? vdev_mirror_io_start+0xac/0x190 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa012d150>] ? vdev_mirror_child_done+0x0/0x30 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffff8101c829>] ? read_tsc+0x9/0x10 Apr 4 21:06:29 wolf-4 kernel: [<ffffffff810d814c>] ? ktime_get_ts64+0x4c/0xf0 Apr 4 21:06:29 wolf-4 kernel: [<ffffffff8163a01e>] io_schedule_timeout+0xae/0x130 Apr 4 21:06:29 wolf-4 kernel: [<ffffffff810a6896>] ? prepare_to_wait_exclusive+0x56/0x90 Apr 4 21:06:29 wolf-4 kernel: [<ffffffff8163a0b8>] io_schedule+0x18/0x20 Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00116e0>] cv_wait_common+0xb0/0x150 [spl] Apr 4 21:06:29 wolf-4 kernel: [<ffffffff810a6ae0>] ? autoremove_wake_function+0x0/0x40 Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00117d8>] __cv_wait_io+0x18/0x20 [spl] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa016db83>] zio_wait+0x123/0x210 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00c44ad>] dbuf_read+0x30d/0x930 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00c60eb>] dmu_buf_will_dirty+0x4b/0xa0 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00deb46>] dnode_dirty_l1+0x36/0x50 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00e146a>] dnode_free_range+0x4da/0x6b0 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa00cb829>] dmu_free_long_range+0x1d9/0x280 [zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa11fe468>] osd_unlinked_object_free+0x28/0x280 [osd_zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa11fe7d3>] osd_unlinked_list_emptify+0x63/0xa0 [osd_zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa120112a>] osd_trans_stop+0x31a/0x5b0 [osd_zfs] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa12fef9f>] ofd_trans_stop+0x1f/0x60 [ofd] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa1302bb2>] ofd_object_destroy+0x2b2/0x890 [ofd] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa12fb8f7>] ofd_destroy_by_fid+0x307/0x510 [ofd] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa0fce170>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa0fc8730>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa12edc67>] ofd_destroy_hdl+0x267/0xa50 [ofd] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa105af65>] tgt_request_handle+0x915/0x1320 [ptlrpc] Apr 4 21:06:29 wolf-4 kernel: [<ffffffffa1007beb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] Apr 4 21:06:30 wolf-4 kernel: [<ffffffffa0cec578>] ? lc_watchdog_touch+0x68/0x180 [libcfs] Apr 4 21:06:30 wolf-4 kernel: [<ffffffffa1006498>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Apr 4 21:06:30 wolf-4 kernel: [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 Apr 4 21:06:30 wolf-4 kernel: [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 Apr 4 21:06:30 wolf-4 kernel: [<ffffffffa100bc90>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] Apr 4 21:06:30 wolf-4 kernel: [<ffffffffa100b200>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] Apr 4 21:06:30 wolf-4 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Apr 4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Apr 4 21:06:30 wolf-4 kernel: [<ffffffff81645998>] ret_from_fork+0x58/0x90 Apr 4 21:06:30 wolf-4 kernel: [<ffffffff810a5a20>] ? kthread+0x0/0xe0 Apr 4 21:06:30 wolf-4 kernel: Apr 4 21:06:30 wolf-4 kernel: LustreError: dumping log to /tmp/lustre-log.1459803990.12256 Apr 4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied Apr 4 21:07:12 wolf-4 ntpd_intres[3829]: ntp_intres.request: permission denied Apr 4 21:07:19 wolf-4 kernel: LNet: Service thread pid 12256 completed after 249.57s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). But the system appeared to keep running past that and then come to a grinding halt. Run began: Mon Apr 4 23:25:07 2016
Command line used: /usr/bin/IOR -i 3 -v -b 50G -t 128k
Machine: Linux wolf-6.wolf.hpdd.intel.com
Start time skew across all tasks: 8.97 sec
Path: /mnt/lustre
FS: 37.9 TiB Used FS: 0.6% Inodes: 171.8 Mi Used Inodes: 32.0%
Participating tasks: 4
Summary:
api = POSIX
test filename = testFile
access = single-shared-file
pattern = segmented (1 segment)
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 4 (1 per node)
repetitions = 3
xfersize = 131072 bytes
blocksize = 50 GiB
aggregate filesize = 200 GiB
No messages appeared on wolf-4 OST1 but it was noticed by clients: Apr 4 23:37:12 wolf-6 kernel: Lustre: 2811:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813025/real 0] req@ffff880d8570ad00 x1530711614167736/t0(0) o400->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 224/224 e 0 to 1 dl 1459813032 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 4 23:37:12 wolf-6 kernel: Lustre: coral-OST0001-osc-ffff881030662800: Connection to coral-OST0001 (at 192.168.1.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Apr 4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1459812992/real 1459812992] req@ffff880266e0bc00 x1530711614167672/t0(0) o3->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:6/4 lens 608/432 e 0 to 1 dl 1459813036 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 4 23:37:16 wolf-6 kernel: Lustre: 2751:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Apr 4 23:37:18 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1459813032/real 0] req@ffff880bbe376900 x1530711614167740/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813038 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 4 23:37:37 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813057/real 1459813057] req@ffff880bbe375d00 x1530711614167756/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813068 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 4 23:38:02 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813082/real 1459813082] req@ffff880bbe375100 x1530711614167772/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813098 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied Apr 4 23:38:11 wolf-6 ntpd_intres[2008]: ntp_intres.request: permission denied Apr 4 23:38:27 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813107/real 1459813107] req@ffff880bbe374500 x1530711614167788/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813128 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 4 23:38:52 wolf-6 kernel: Lustre: 2742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1459813132/real 1459813132] req@ffff880bbe373900 x1530711614167804/t0(0) o8->coral-OST0001-osc-ffff881030662800@192.168.1.4@o2ib:28/4 lens 520/544 e 0 to 1 dl 1459813158 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 From stats being captured on OST1: time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:30:53 0 0 0 0 0 0 0 0 0 200M 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:31:29 0 0 0 0 0 0 0 0 0 200M 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:32:06 0 0 0 0 0 0 0 0 0 200M 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:32:42 0 0 0 0 0 0 0 0 0 200M 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:33:19 0 0 0 0 0 0 0 0 0 200M 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:33:55 0 0 0 0 0 0 0 0 0 62G 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:34:36 0 0 0 0 0 0 0 0 0 62G 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:35:18 10 10 100 10 100 0 0 0 0 62G 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:35:58 91 91 100 47 100 44 100 1 100 63G 62G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:36:39 5 5 100 5 100 0 0 0 0 62G 62G
6 1 0x01 91 4368 5368758848 527202119883608 name type data hits 4 91379399 misses 4 70802556 demand_data_hits 4 62879518 demand_data_misses 4 52414186 demand_metadata_hits 4 23738570 demand_metadata_misses 4 1105675 prefetch_data_hits 4 4757199 prefetch_data_misses 4 17281282 prefetch_metadata_hits 4 4112 prefetch_metadata_misses 4 1413 mru_hits 4 62351920 mru_ghost_hits 4 917736 mfu_hits 4 24266169 mfu_ghost_hits 4 632285 deleted 4 116650113 mutex_miss 4 447 evict_skip 4 184 evict_not_enough 4 62 evict_l2_cached 4 0 evict_l2_eligible 4 15019191377920 evict_l2_ineligible 4 408751927296 evict_l2_skip 4 0 hash_elements 4 1059177 hash_elements_max 4 1126648 hash_collisions 4 11087755 hash_chains 4 31686 hash_chain_max 4 4 p 4 63241464064 c 4 67457282048 c_min 4 33554432 c_max 4 67457282048 size 4 67456702536 hdr_size 4 396803920 data_size 4 66823651328 metadata_size 4 76845056 other_size 4 159402232 anon_size 4 39337984 anon_evictable_data 4 0 anon_evictable_metadata 4 0 mru_size 4 64780652544 mru_evictable_data 4 64752189440 mru_evictable_metadata 4 327680 mru_ghost_size 4 2679063040 mru_ghost_evictable_data 4 2082865152 mru_ghost_evictable_metadata 4 596197888 mfu_size 4 2080505856 mfu_evictable_data 4 2024931328 mfu_evictable_metadata 4 3112960 mfu_ghost_size 4 56092389888 mfu_ghost_evictable_data 4 56057790464 mfu_ghost_evictable_metadata 4 34599424 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_lock_retry 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_evict_l1cached 4 0 l2_free_on_write 4 0 l2_cdata_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_asize 4 0 l2_hdr_size 4 0 l2_compress_successes 4 0 l2_compress_zeros 4 0 l2_compress_failures 4 0 memory_throttle_count 4 0 duplicate_buffers 4 0 duplicate_buffers_size 4 0 duplicate_reads 4 0 memory_direct_count 4 0 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 0 arc_meta_used 4 633051832 arc_meta_limit 4 50592961536 arc_meta_max 4 809615080 arc_meta_min 4 16777216 arc_need_free 4 0 arc_sys_free 4 2108039168 proc_arcstats.2016-04-04:21:52:28:misses 4 60302081 proc_arcstats.2016-04-04:21:53:05:misses 4 60302081 proc_arcstats.2016-04-04:21:53:41:misses 4 60302081 proc_arcstats.2016-04-04:21:54:18:misses 4 60302081 proc_arcstats.2016-04-04:21:54:55:misses 4 60302081 proc_arcstats.2016-04-04:21:55:31:misses 4 60302081 proc_arcstats.2016-04-04:21:56:08:misses 4 60302081 proc_arcstats.2016-04-04:21:56:44:misses 4 60302081 proc_arcstats.2016-04-04:21:57:21:misses 4 60302081 proc_arcstats.2016-04-04:21:57:57:misses 4 60302081 proc_arcstats.2016-04-04:21:58:34:misses 4 60302081 proc_arcstats.2016-04-04:21:59:11:misses 4 60302081 proc_arcstats.2016-04-04:21:59:47:misses 4 60302099 proc_arcstats.2016-04-04:22:00:25:misses 4 60302137 proc_arcstats.2016-04-04:22:01:06:misses 4 60374920 proc_arcstats.2016-04-04:22:01:47:misses 4 60513681 proc_arcstats.2016-04-04:22:02:27:misses 4 60644796 proc_arcstats.2016-04-04:22:03:08:misses 4 60782163 proc_arcstats.2016-04-04:22:03:49:misses 4 60921202 proc_arcstats.2016-04-04:22:04:30:misses 4 61049984 proc_arcstats.2016-04-04:22:05:10:misses 4 61131959 proc_arcstats.2016-04-04:22:05:51:misses 4 61133558 proc_arcstats.2016-04-04:22:06:32:misses 4 61203715 proc_arcstats.2016-04-04:22:07:13:misses 4 61317428 proc_arcstats.2016-04-04:22:07:54:misses 4 61406420 proc_arcstats.2016-04-04:22:08:35:misses 4 61520867 proc_arcstats.2016-04-04:22:09:15:misses 4 61659833 proc_arcstats.2016-04-04:22:09:56:misses 4 61796365 proc_arcstats.2016-04-04:22:10:37:misses 4 61905346 proc_arcstats.2016-04-04:22:11:17:misses 4 61977003 proc_arcstats.2016-04-04:22:11:56:misses 4 61977043 proc_arcstats.2016-04-04:22:12:37:misses 4 62083132 proc_arcstats.2016-04-04:22:13:17:misses 4 62215655 proc_arcstats.2016-04-04:22:13:57:misses 4 62348643 proc_arcstats.2016-04-04:22:14:38:misses 4 62494512 proc_arcstats.2016-04-04:22:15:18:misses 4 62630944 proc_arcstats.2016-04-04:22:15:59:misses 4 62727592 proc_arcstats.2016-04-04:22:16:38:misses 4 62806912 proc_arcstats.2016-04-04:22:17:18:misses 4 62809594 proc_arcstats.2016-04-04:22:17:59:misses 4 62832835 proc_arcstats.2016-04-04:22:18:40:misses 4 62874008 proc_arcstats.2016-04-04:22:19:20:misses 4 62915527 proc_arcstats.2016-04-04:22:20:00:misses 4 62956780 proc_arcstats.2016-04-04:22:20:40:misses 4 62998788 proc_arcstats.2016-04-04:22:21:20:misses 4 63041800 proc_arcstats.2016-04-04:22:22:01:misses 4 63083422 proc_arcstats.2016-04-04:22:22:41:misses 4 63125355 proc_arcstats.2016-04-04:22:23:21:misses 4 63167419 proc_arcstats.2016-04-04:22:24:01:misses 4 63210702 proc_arcstats.2016-04-04:22:24:41:misses 4 63252813 proc_arcstats.2016-04-04:22:25:21:misses 4 63295877 proc_arcstats.2016-04-04:22:26:01:misses 4 63339516 proc_arcstats.2016-04-04:22:26:41:misses 4 63383531 proc_arcstats.2016-04-04:22:27:22:misses 4 63417625 proc_arcstats.2016-04-04:22:28:02:misses 4 63418160 proc_arcstats.2016-04-04:22:28:44:misses 4 63418448 proc_arcstats.2016-04-04:22:29:25:misses 4 63446807 proc_arcstats.2016-04-04:22:30:06:misses 4 63605174 proc_arcstats.2016-04-04:22:30:46:misses 4 63787444 proc_arcstats.2016-04-04:22:31:27:misses 4 63991757 proc_arcstats.2016-04-04:22:32:08:misses 4 64186064 proc_arcstats.2016-04-04:22:32:48:misses 4 64370507 proc_arcstats.2016-04-04:22:33:29:misses 4 64544830 proc_arcstats.2016-04-04:22:34:10:misses 4 64705200 proc_arcstats.2016-04-04:22:34:51:misses 4 64890908 proc_arcstats.2016-04-04:22:35:31:misses 4 65051186 proc_arcstats.2016-04-04:22:36:12:misses 4 65093562 proc_arcstats.2016-04-04:22:36:48:misses 4 65093566 proc_arcstats.2016-04-04:22:37:25:misses 4 65093566 proc_arcstats.2016-04-04:22:38:02:misses 4 65093566 proc_arcstats.2016-04-04:22:38:38:misses 4 65093566 proc_arcstats.2016-04-04:22:39:15:misses 4 65093566 proc_arcstats.2016-04-04:22:39:51:misses 4 65093566 proc_arcstats.2016-04-04:22:40:28:misses 4 65093566 proc_arcstats.2016-04-04:22:41:05:misses 4 65093566 proc_arcstats.2016-04-04:22:41:41:misses 4 65093566 proc_arcstats.2016-04-04:22:42:18:misses 4 65093566 proc_arcstats.2016-04-04:22:42:54:misses 4 65093566 proc_arcstats.2016-04-04:22:43:31:misses 4 65093566 proc_arcstats.2016-04-04:22:44:07:misses 4 65093566 proc_arcstats.2016-04-04:22:44:44:misses 4 65093593 proc_arcstats.2016-04-04:22:45:24:misses 4 65093635 proc_arcstats.2016-04-04:22:46:05:misses 4 65093688 proc_arcstats.2016-04-04:22:46:47:misses 4 65204992 proc_arcstats.2016-04-04:22:47:27:misses 4 65369668 proc_arcstats.2016-04-04:22:48:08:misses 4 65543276 proc_arcstats.2016-04-04:22:48:49:misses 4 65700706 proc_arcstats.2016-04-04:22:49:29:misses 4 65835289 proc_arcstats.2016-04-04:22:50:10:misses 4 65978426 proc_arcstats.2016-04-04:22:50:51:misses 4 66102294 proc_arcstats.2016-04-04:22:51:32:misses 4 66237487 proc_arcstats.2016-04-04:22:52:12:misses 4 66372029 proc_arcstats.2016-04-04:22:52:53:misses 4 66491789 proc_arcstats.2016-04-04:22:53:33:misses 4 66606048 proc_arcstats.2016-04-04:22:54:14:misses 4 66728350 proc_arcstats.2016-04-04:22:54:54:misses 4 66759150 proc_arcstats.2016-04-04:22:55:35:misses 4 66783980 proc_arcstats.2016-04-04:22:56:16:misses 4 66882343 proc_arcstats.2016-04-04:22:56:56:misses 4 66986649 proc_arcstats.2016-04-04:22:57:37:misses 4 67099214 proc_arcstats.2016-04-04:22:58:18:misses 4 67236074 proc_arcstats.2016-04-04:22:58:58:misses 4 67359854 proc_arcstats.2016-04-04:22:59:39:misses 4 67479208 proc_arcstats.2016-04-04:23:00:19:misses 4 67603703 proc_arcstats.2016-04-04:23:01:00:misses 4 67605334 proc_arcstats.2016-04-04:23:01:42:misses 4 67687472 proc_arcstats.2016-04-04:23:02:22:misses 4 67823336 proc_arcstats.2016-04-04:23:03:03:misses 4 67950500 proc_arcstats.2016-04-04:23:03:44:misses 4 68083660 proc_arcstats.2016-04-04:23:04:24:misses 4 68221409 proc_arcstats.2016-04-04:23:05:04:misses 4 68335712 proc_arcstats.2016-04-04:23:05:45:misses 4 68441748 proc_arcstats.2016-04-04:23:06:24:misses 4 68443307 proc_arcstats.2016-04-04:23:07:05:misses 4 68454637 proc_arcstats.2016-04-04:23:07:46:misses 4 68578571 proc_arcstats.2016-04-04:23:08:27:misses 4 68690534 proc_arcstats.2016-04-04:23:09:07:misses 4 68808692 proc_arcstats.2016-04-04:23:09:48:misses 4 68930912 proc_arcstats.2016-04-04:23:10:28:misses 4 69068534 proc_arcstats.2016-04-04:23:11:09:misses 4 69192434 proc_arcstats.2016-04-04:23:11:50:misses 4 69283782 proc_arcstats.2016-04-04:23:12:30:misses 4 69284522 proc_arcstats.2016-04-04:23:13:12:misses 4 69315159 proc_arcstats.2016-04-04:23:13:52:misses 4 69373955 proc_arcstats.2016-04-04:23:14:33:misses 4 69449303 proc_arcstats.2016-04-04:23:15:13:misses 4 69523499 proc_arcstats.2016-04-04:23:15:53:misses 4 69599070 proc_arcstats.2016-04-04:23:16:34:misses 4 69674993 proc_arcstats.2016-04-04:23:17:14:misses 4 69750733 proc_arcstats.2016-04-04:23:17:55:misses 4 69827307 proc_arcstats.2016-04-04:23:18:35:misses 4 69847427 proc_arcstats.2016-04-04:23:19:16:misses 4 69848231 proc_arcstats.2016-04-04:23:19:57:misses 4 69884019 proc_arcstats.2016-04-04:23:20:37:misses 4 69946465 proc_arcstats.2016-04-04:23:21:18:misses 4 70016284 proc_arcstats.2016-04-04:23:21:58:misses 4 70081329 proc_arcstats.2016-04-04:23:22:39:misses 4 70148143 proc_arcstats.2016-04-04:23:23:19:misses 4 70213937 proc_arcstats.2016-04-04:23:23:59:misses 4 70283057 proc_arcstats.2016-04-04:23:24:40:misses 4 70353187 proc_arcstats.2016-04-04:23:25:21:misses 4 70415880 proc_arcstats.2016-04-04:23:26:00:misses 4 70415996 proc_arcstats.2016-04-04:23:26:37:misses 4 70415996 proc_arcstats.2016-04-04:23:27:13:misses 4 70415996 proc_arcstats.2016-04-04:23:27:50:misses 4 70415996 proc_arcstats.2016-04-04:23:28:26:misses 4 70415996 proc_arcstats.2016-04-04:23:29:03:misses 4 70415996 proc_arcstats.2016-04-04:23:29:40:misses 4 70415996 proc_arcstats.2016-04-04:23:30:16:misses 4 70415996 proc_arcstats.2016-04-04:23:30:53:misses 4 70415996 proc_arcstats.2016-04-04:23:31:29:misses 4 70415996 proc_arcstats.2016-04-04:23:32:06:misses 4 70415996 proc_arcstats.2016-04-04:23:32:42:misses 4 70415996 proc_arcstats.2016-04-04:23:33:19:misses 4 70415996 proc_arcstats.2016-04-04:23:33:55:misses 4 70416034 proc_arcstats.2016-04-04:23:34:36:misses 4 70416109 proc_arcstats.2016-04-04:23:35:18:misses 4 70417565 proc_arcstats.2016-04-04:23:35:58:misses 4 70597975 proc_arcstats.2016-04-04:23:36:39:misses 4 70802556 |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 05/Apr/16 ] |
|
Hi John, Can you please reset the ZFS release to the commit zfs-0.6.4-92 and run it one more time? zfs-0.6.4-92 doesn't include the ARC change this way it will help us make sure that we reproduced |
| Comment by John Salinas (Inactive) [ 06/Apr/16 ] |
|
This seems to hit every 2-4 hours on my wolf nodes with IOR running. I had to reserve 2G of memory in order to get the dump to complete otherwise while it was dumping it would fail with OOM errors and the dump would not complete. The good news is this is solved, the bad news is /var/crash is not big enough for a dump so we only have a partial dump file: [15001.472312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 |
| Comment by John Salinas (Inactive) [ 09/Apr/16 ] |
|
A full dump can be found at: /scratch/dumps/wolf-4.wolf.hpdd.intel.com/10.8.1.4-2016-04-09-16\:36\:15/ For now I am disabling the running of /bin/dbufstat.py (part of zfs) as part of our stat collection. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/16 ] |
|
From the arcstats, p is very close to c, and c equals to c_max. This means there is no memory pressure at all, which can be verified by direct or indirect count below. The mru ghost hit is only 917736 I don't know what made p that large. John: If crash dump is problematic due to size, we can reduce the size of memory by appending 'mem=?G' to boot line. |
| Comment by John Salinas (Inactive) [ 14/Apr/16 ] |
|
Yes, in the second comment I mention I set mem=2G in order to get a dump. However, I believe the dump is due to a separate issue. I could be wrong on this. Please see the one full dump I have on the /scratch nfs dir on wolf. |
| Comment by John Salinas (Inactive) [ 14/Apr/16 ] |
|
wolf has been running today and the only error I have seen is this: [26193.993322] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26194.007089] LustreError: 7841:0:(osd_handler.c:222:osd_trans_start()) Skipped 17 previous similar messages [26195.013534] LustreError: 7812:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26197.058425] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [26197.072181] LustreError: 7888:0:(osd_handler.c:222:osd_trans_start()) Skipped 31 previous similar messages [26202.409535] LustreError: 7824:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for usr 0: rc = -52 [26202.473611] LustreError: 7824:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x142:0x0] from accounting ZAP for grp 0: rc = -52 [26213.811138] LustreError: 7823:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for usr 0: rc = -52 [26213.834184] LustreError: 7823:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x141:0x0] from accounting ZAP for grp 0: rc = -52 [root@wolf-4 stat]# [26390.320802] INFO: task ll_ost_io04_001:7405 blocked for more than 120 seconds. [26390.331510] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.342516] ll_ost_io04_001 D 0000000000000000 0 7405 2 0x00000080 [26390.352479] ffff881fe19f3820 0000000000000046 ffff881fe19d0b80 ffff881fe19f3fd8 [26390.362848] ffff881fe19f3fd8 ffff881fe19f3fd8 ffff881fe19d0b80 ffff882023826188 [26390.373083] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.383196] Call Trace: [26390.387579] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.394805] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.403226] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.411237] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.418873] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.426784] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.434776] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.443327] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.451296] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.459814] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26390.467655] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26390.475793] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26390.483986] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26390.491214] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26390.500665] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26390.507766] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26390.515856] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26390.524919] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26390.533603] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26390.543120] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26390.551721] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26390.560263] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26390.568210] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26390.575685] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26390.583661] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26390.592883] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26390.599299] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.607549] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26390.614505] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.622721] INFO: task ll_ost_io06_001:7411 blocked for more than 120 seconds. [26390.631720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.641427] ll_ost_io06_001 D 0000000000000000 0 7411 2 0x00000080 [26390.650360] ffff881fcfe83820 0000000000000046 ffff881fe19d5080 ffff881fcfe83fd8 [26390.659664] ffff881fcfe83fd8 ffff881fcfe83fd8 ffff881fe19d5080 ffff882023826188 [26390.668954] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.678219] Call Trace: [26390.681832] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.688244] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.696005] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.703365] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.710421] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.717767] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.725265] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.733340] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.740903] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.749078] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26390.756745] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26390.764682] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26390.772773] [<ffffffff810bb685>] ? sched_clock_cpu+0x85/0xc0 [26390.780139] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26390.789160] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26390.797788] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26390.807286] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26390.815831] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26390.824360] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26390.832327] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26390.841543] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26390.847934] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.856187] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26390.863168] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26390.871421] INFO: task ll_ost_io06_002:7412 blocked for more than 120 seconds. [26390.880442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26390.890160] ll_ost_io06_002 D 0000000000000000 0 7412 2 0x00000080 [26390.899063] ffff881fcfea3820 0000000000000046 ffff881fe19d5c00 ffff881fcfea3fd8 [26390.908403] ffff881fcfea3fd8 ffff881fcfea3fd8 ffff881fe19d5c00 ffff882023826188 [26390.917719] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26390.927047] Call Trace: [26390.930756] [<ffffffff8163aab9>] schedule+0x29/0x70 [26390.937302] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26390.945182] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26390.952678] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26390.959892] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26390.967411] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26390.975079] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26390.983333] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26390.991091] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26390.999426] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.007107] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.015062] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.023154] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26391.030285] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.039640] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.046664] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.054686] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.063693] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.072310] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.081784] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.090329] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.098827] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.106759] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.114233] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.122197] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.131389] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.137766] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.145954] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.152640] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.160581] INFO: task ll_ost_io06_003:7798 blocked for more than 120 seconds. [26391.169290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.178691] ll_ost_io06_003 D 0000000000000000 0 7798 2 0x00000080 [26391.187282] ffff881fcf707820 0000000000000046 ffff880fe632b980 ffff881fcf707fd8 [26391.196297] ffff881fcf707fd8 ffff881fcf707fd8 ffff880fe632b980 ffff882023826188 [26391.205367] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.214356] Call Trace: [26391.217695] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.223932] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.231443] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.238673] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.245587] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.252750] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.260036] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.267960] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.275409] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.283418] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.290698] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.298359] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.306122] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26391.312957] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.321966] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.328705] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.336477] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.345249] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.353534] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.362639] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.370865] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.379030] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.386624] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.393808] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.401443] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.410245] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.416274] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.424248] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.430967] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.438826] INFO: task ll_ost_io04_003:7800 blocked for more than 120 seconds. [26391.447481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.456819] ll_ost_io04_003 D 0000000000000000 0 7800 2 0x00000080 [26391.465395] ffff881fceb03820 0000000000000046 ffff880fe632d080 ffff881fceb03fd8 [26391.474374] ffff881fceb03fd8 ffff881fceb03fd8 ffff880fe632d080 ffff882023826188 [26391.483386] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.492473] Call Trace: [26391.495789] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.501972] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.509493] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.516677] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.523585] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.530798] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.538139] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.546029] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.553411] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.561413] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.568753] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.576388] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.584183] [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc] [26391.592094] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.601111] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.607880] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.615618] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.624310] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.632656] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.641806] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.650048] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.658298] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.665909] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.673105] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.680724] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.689587] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.695667] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.703689] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.710360] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.718298] INFO: task ll_ost_io04_005:7805 blocked for more than 120 seconds. [26391.727050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26391.736388] ll_ost_io04_005 D 0000000000000000 0 7805 2 0x00000080 [26391.744976] ffff881fcd54f820 0000000000000046 ffff880fe6328b80 ffff881fcd54ffd8 [26391.753892] ffff881fcd54ffd8 ffff881fcd54ffd8 ffff880fe6328b80 ffff882023826188 [26391.762929] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26391.771955] Call Trace: [26391.775349] [<ffffffff8163aab9>] schedule+0x29/0x70 [26391.781550] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26391.789027] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26391.796180] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26391.803088] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26391.810220] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26391.817518] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26391.825398] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26391.832827] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26391.840834] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26391.848185] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26391.855785] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26391.863569] [<ffffffffa0e91e38>] ? ldlm_bl_timeout+0x18/0x30 [ptlrpc] [26391.871459] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26391.880474] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26391.887226] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26391.894979] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26391.903711] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26391.912027] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26391.921212] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26391.929425] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26391.937700] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26391.945284] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26391.952467] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26391.960077] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26391.968982] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26391.975100] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.983005] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26391.989681] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26391.997598] INFO: task ll_ost_io06_005:7810 blocked for more than 120 seconds. [26392.006309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.015707] ll_ost_io06_005 D 0000000000000000 0 7810 2 0x00000080 [26392.024252] ffff881fc94fb820 0000000000000046 ffff882007258b80 ffff881fc94fbfd8 [26392.033325] ffff881fc94fbfd8 ffff881fc94fbfd8 ffff882007258b80 ffff882023826188 [26392.042335] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.051388] Call Trace: [26392.054740] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.060967] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.068507] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.075715] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.082626] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.089842] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.097207] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.105081] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.112560] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.120610] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.127927] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.135570] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.143385] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.150241] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.159307] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.166078] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.173820] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.182548] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.190893] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.200052] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.208379] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.216617] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.224260] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.231463] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.239134] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.248015] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.254058] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.262001] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.268656] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.276534] INFO: task ll_ost_io06_006:7811 blocked for more than 120 seconds. [26392.285240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.294631] ll_ost_io06_006 D 0000000000000000 0 7811 2 0x00000080 [26392.303192] ffff881fc4f4f820 0000000000000046 ffff88200725e780 ffff881fc4f4ffd8 [26392.312202] ffff881fc4f4ffd8 ffff881fc4f4ffd8 ffff88200725e780 ffff882023826188 [26392.321131] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.330084] Call Trace: [26392.333495] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.339767] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.347297] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.354490] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.361328] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.368491] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.375831] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.383699] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.391100] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.399078] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.406398] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.414017] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.421798] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.428654] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.437717] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.444446] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.452174] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.460902] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.469160] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.478343] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.486615] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.494801] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.502410] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.509565] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.517186] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.526081] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.532152] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.540030] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.546706] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.554619] INFO: task ll_ost_io04_006:7812 blocked for more than 120 seconds. [26392.563275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.572639] ll_ost_io04_006 D 0000000000000000 0 7812 2 0x00000080 [26392.581244] ffff881fc2dbb820 0000000000000046 ffff88200725f300 ffff881fc2dbbfd8 [26392.590196] ffff881fc2dbbfd8 ffff881fc2dbbfd8 ffff88200725f300 ffff882023826188 [26392.599200] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.608179] Call Trace: [26392.611461] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.617615] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.625146] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.632336] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.639157] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.646282] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.653554] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.661465] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.668859] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.676864] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.684204] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.691773] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.699584] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.706366] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.715432] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26392.722155] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26392.729943] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26392.738687] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26392.746981] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26392.756134] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26392.764374] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26392.772528] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26392.780169] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26392.787304] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26392.794941] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26392.803821] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26392.809884] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.817799] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26392.824441] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26392.832351] INFO: task ll_ost_io06_007:7814 blocked for more than 120 seconds. [26392.841032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26392.850402] ll_ost_io06_007 D 0000000000000000 0 7814 2 0x00000080 [26392.858975] ffff882003b8f820 0000000000000046 ffff881ffb070b80 ffff882003b8ffd8 [26392.868112] ffff882003b8ffd8 ffff882003b8ffd8 ffff881ffb070b80 ffff882023826188 [26392.877083] ffff882023826150 ffff882023826190 ffff882023826178 0000000000000000 [26392.886070] Call Trace: [26392.889475] [<ffffffff8163aab9>] schedule+0x29/0x70 [26392.895733] [<ffffffffa0010fa5>] cv_wait_common+0x125/0x150 [spl] [26392.903261] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [26392.910430] [<ffffffffa0010fe5>] __cv_wait+0x15/0x20 [spl] [26392.917280] [<ffffffffa00d92f3>] dmu_tx_wait+0xb3/0x3c0 [zfs] [26392.924478] [<ffffffffa00d969a>] dmu_tx_assign+0x9a/0x510 [zfs] [26392.931789] [<ffffffffa10bf2ba>] osd_trans_start+0xaa/0x3c0 [osd_zfs] [26392.939703] [<ffffffffa11baeeb>] ofd_trans_start+0x6b/0xe0 [ofd] [26392.947081] [<ffffffffa11c0d43>] ofd_commitrw_write+0x7d3/0x1ad0 [ofd] [26392.955051] [<ffffffffa11c4c1d>] ofd_commitrw+0x51d/0xa40 [ofd] [26392.962394] [<ffffffffa0f41cee>] obd_commitrw+0x2ec/0x32f [ptlrpc] [26392.970013] [<ffffffffa0f219c1>] tgt_brw_write+0xea1/0x1650 [ptlrpc] [26392.977795] [<ffffffff81631c85>] ? __slab_free+0x10e/0x277 [26392.984610] [<ffffffffa11a4745>] ? ofd_prolong_one_lock.isra.36+0x115/0x1c0 [ofd] [26392.993681] [<ffffffff810c14bc>] ? update_curr+0xcc/0x150 [26393.000376] [<ffffffff810be2ee>] ? account_entity_dequeue+0xae/0xd0 [26393.008111] [<ffffffffa0e797d0>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [26393.016799] [<ffffffffa0f1d385>] tgt_request_handle+0x915/0x1320 [ptlrpc] [26393.025125] [<ffffffffa0eca02b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [26393.034281] [<ffffffffa0a5b608>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [26393.042522] [<ffffffffa0ec88d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [26393.050759] [<ffffffff810b8c22>] ? default_wake_function+0x12/0x20 [26393.058388] [<ffffffff810af028>] ? __wake_up_common+0x58/0x90 [26393.065579] [<ffffffffa0ece0d0>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] [26393.073196] [<ffffffffa0ecd640>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [26393.082080] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [26393.088140] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [26393.096053] [<ffffffff81645a58>] ret_from_fork+0x58/0x90 [26393.102718] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [27451.932819] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27451.945718] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 2 previous similar messages [27453.007869] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27453.020784] LustreError: 7897:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message [27455.047314] LustreError: 7817:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27462.623135] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) lsdraid-OST0001: can't assign tx: rc = -5 [27462.624172] LustreError: 7823:0:(ofd_dev.c:1884:ofd_destroy_hdl()) lsdraid-OST0001: error destroying object [0x100010000:0x144:0x0]: -5 [27462.652652] LustreError: 7821:0:(osd_handler.c:222:osd_trans_start()) Skipped 1 previous similar message [28445.940895] perf interrupt took too long (5003 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [root@wolf-4 stat]# [32239.225932] LustreError: 7821:0:(osd_object.c:606:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for usr 0: rc = -52 [32239.248290] LustreError: 7821:0:(osd_object.c:612:osd_object_destroy()) lsdraid-OST0001: failed to remove [0x100010000:0x145:0x0] from accounting ZAP for grp 0: rc = -52 |
| Comment by Jinshan Xiong (Inactive) [ 14/Apr/16 ] |
|
The message "lsdraid-OST0001: can't assign tx: rc = -5" seems serious. One thing I would to do is to check the status of zpool. I will log into the node tomorrow and take a look. |
| Comment by John Salinas (Inactive) [ 05/May/16 ] |
|
Re-did the configuration so we now have 4 OSTs on 2 OSS like this:
NAME STATE READ WRITE CKSUM After several hours we hit: 134217728 ns 4 819 [335587.853762] LNet: Service thread pid 62899 was inactive for 200.63s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
It appears things from the client perspective things have deadlocked and are not making progress anymore. Going to reboot OSS0/wolf-3. |
| Comment by John Salinas (Inactive) [ 13/Apr/17 ] |
|
This is over a year old – I don't recall hitting this on 2.9.0 with ZFS 0.7.0. RC3 |