Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0
-
3
-
9223372036854775807
Description
This problem occurred in soaked test.
There are two suspicious call trace which leads me to suspect there exists a deadlock in the code:
LNet: Service thread pid 30124 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: LNet: Skipped 3 previous similar messages Pid: 30124, comm: ll_ost_io00_116 Call Trace: [<ffffffff81178260>] ? cache_alloc_refill+0x1c0/0x240 [<ffffffff81291489>] ? cpumask_next_and+0x29/0x50 [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff8153c2e4>] ? down_read+0x24/0x30 [<ffffffffa0f13917>] osd_attr_get+0x77/0x2e0 [osd_zfs] [<ffffffffa1022301>] ofd_attr_handle_ugid+0x161/0x430 [ofd] [<ffffffffa1027ebd>] ofd_write_attr_set+0xad/0x9e0 [ofd] [<ffffffffa102933f>] ofd_commitrw_write+0x23f/0x10a0 [ofd] [<ffffffffa102df1d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd] [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd] [<ffffffff81149461>] ? kzfree+0x31/0x40 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc] [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
And then another one:
Call Trace: [<ffffffffa015208c>] ? mlx4_ib_post_send+0x4fc/0x1370 [mlx4_ib] [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff8153c2e4>] ? down_read+0x24/0x30 [<ffffffffa037e9b0>] dmu_buf_hold_array_by_dnode+0x50/0x4a0 [zfs] [<ffffffffa037f85b>] dmu_buf_hold_array_by_bonus+0x6b/0x90 [zfs] [<ffffffffa0f149af>] osd_bufs_get+0x4bf/0xb50 [osd_zfs] [<ffffffffa07c0af6>] ? lu_object_find+0x16/0x20 [obdclass] [<ffffffffa102893f>] ofd_preprw_read+0x14f/0x910 [ofd] [<ffffffffa102b008>] ofd_preprw+0x358/0x14f0 [ofd] [<ffffffffa0a1774f>] obd_preprw+0x10f/0x380 [ptlrpc] [<ffffffffa0a217c5>] tgt_brw_read+0x695/0x11d0 [ptlrpc] [<ffffffffa09e4194>] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc] [<ffffffffa09bab8b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc] [<ffffffffa09bacc6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc] [<ffffffffa067ec8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Pid: 29297, comm: ll_ost_io00_003
Two more:
ll_ost_io00_0 D 0000000000000003 0 29439 2 0x00000080 ffff8803d6a1f6f0 0000000000000046 0000000000000000 ffff8803d6a1f6b4 0000000000000000 ffff88043fe82400 00006a78551f8e00 ffff8800387f5a00 0000000000002c62 0000000106f6922c ffff8803d6a1bad8 ffff8803d6a1ffd8 Call Trace: [<ffffffffa0314e5d>] cv_wait_common+0x11d/0x130 [spl] [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0314ec5>] __cv_wait+0x15/0x20 [spl] [<ffffffffa03757fb>] dbuf_read+0x3bb/0x850 [zfs] [<ffffffff8153afdb>] ? mutex_lock+0x2b/0x50 [<ffffffffa0372a13>] ? dbuf_find+0x193/0x1a0 [zfs] [<ffffffffa0377c78>] dmu_buf_will_dirty+0x58/0xc0 [zfs] [<ffffffffa0377d37>] dbuf_new_size+0x57/0x1c0 [zfs] [<ffffffffa039259c>] dnode_set_blksz+0x30c/0x330 [zfs] [<ffffffffa037cbb8>] dmu_object_set_blocksize+0x48/0x70 [zfs] [<ffffffffa0f15e00>] osd_write_commit+0x770/0x970 [osd_zfs] [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd] [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd] [<ffffffff81149461>] ? kzfree+0x31/0x40 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc] [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
ll_ost_io00_1 D 0000000000000005 0 29763 2 0x00000080 ffff88021676b768 0000000000000046 0000000000000000 ffff88021676b72c ffff880400000000 ffff88043fe82800 00006a78551f88c2 ffff8800387f5a00 0000000000002c58 0000000106f6922c ffff8803d27b9ad8 ffff88021676bfd8 Call Trace: [<ffffffffa030db8e>] ? spl_kmem_zalloc+0x8e/0x190 [spl] [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff8153c2e4>] ? down_read+0x24/0x30 [<ffffffffa0376bb1>] dbuf_dirty+0x381/0x980 [zfs] [<ffffffffa037728a>] dbuf_assign_arcbuf+0xda/0x2b0 [zfs] [<ffffffffa037f679>] dmu_assign_arcbuf+0x149/0x1f0 [zfs] [<ffffffffa0f15ac1>] osd_write_commit+0x431/0x970 [osd_zfs] [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd] [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd] [<ffffffff81149461>] ? kzfree+0x31/0x40 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc] [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Due to this issue, other service threads can't move forward because this tx can never complete.
The node died when I was trying to collect more information. However, these callback trace are enough to start an initial analysis.