[LU-8791] Suspicious deadlock on OST stack Created: 02/Nov/16  Updated: 07/Nov/16  Resolved: 07/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Jinshan Xiong (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: soak

Attachments: File lola-10.lfsck.full-debug.txt.gz     File lola-10.lfsck.hang.txt.gz     File lola-5.mount.fulldebug.txt.gz     File lola-5.mount.hang.txt.gz     Text File lola-8.lfsck.hang.txt.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This problem occurred in soaked test.

There are two suspicious call trace which leads me to suspect there exists a deadlock in the code:

LNet: Service thread pid 30124 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
LNet: Skipped 3 previous similar messages
Pid: 30124, comm: ll_ost_io00_116

Call Trace:
 [<ffffffff81178260>] ? cache_alloc_refill+0x1c0/0x240
 [<ffffffff81291489>] ? cpumask_next_and+0x29/0x50
 [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8153c2e4>] ? down_read+0x24/0x30
 [<ffffffffa0f13917>] osd_attr_get+0x77/0x2e0 [osd_zfs]
 [<ffffffffa1022301>] ofd_attr_handle_ugid+0x161/0x430 [ofd]
 [<ffffffffa1027ebd>] ofd_write_attr_set+0xad/0x9e0 [ofd]
 [<ffffffffa102933f>] ofd_commitrw_write+0x23f/0x10a0 [ofd]
 [<ffffffffa102df1d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd]
 [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
 [<ffffffff81149461>] ? kzfree+0x31/0x40
 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
 [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
 [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

And then another one:

Call Trace:
 [<ffffffffa015208c>] ? mlx4_ib_post_send+0x4fc/0x1370 [mlx4_ib]
 [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8153c2e4>] ? down_read+0x24/0x30
 [<ffffffffa037e9b0>] dmu_buf_hold_array_by_dnode+0x50/0x4a0 [zfs]
 [<ffffffffa037f85b>] dmu_buf_hold_array_by_bonus+0x6b/0x90 [zfs]
 [<ffffffffa0f149af>] osd_bufs_get+0x4bf/0xb50 [osd_zfs]
 [<ffffffffa07c0af6>] ? lu_object_find+0x16/0x20 [obdclass]
 [<ffffffffa102893f>] ofd_preprw_read+0x14f/0x910 [ofd]
 [<ffffffffa102b008>] ofd_preprw+0x358/0x14f0 [ofd]
 [<ffffffffa0a1774f>] obd_preprw+0x10f/0x380 [ptlrpc]
 [<ffffffffa0a217c5>] tgt_brw_read+0x695/0x11d0 [ptlrpc]
 [<ffffffffa09e4194>] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
 [<ffffffffa09bab8b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffffa09bacc6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
 [<ffffffffa067ec8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff81539896>] ? schedule+0x176/0x3a0
 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Pid: 29297, comm: ll_ost_io00_003

Two more:

ll_ost_io00_0 D 0000000000000003     0 29439      2 0x00000080
 ffff8803d6a1f6f0 0000000000000046 0000000000000000 ffff8803d6a1f6b4
 0000000000000000 ffff88043fe82400 00006a78551f8e00 ffff8800387f5a00
 0000000000002c62 0000000106f6922c ffff8803d6a1bad8 ffff8803d6a1ffd8
Call Trace:
 [<ffffffffa0314e5d>] cv_wait_common+0x11d/0x130 [spl]
 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0314ec5>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa03757fb>] dbuf_read+0x3bb/0x850 [zfs]
 [<ffffffff8153afdb>] ? mutex_lock+0x2b/0x50
 [<ffffffffa0372a13>] ? dbuf_find+0x193/0x1a0 [zfs]
 [<ffffffffa0377c78>] dmu_buf_will_dirty+0x58/0xc0 [zfs]
 [<ffffffffa0377d37>] dbuf_new_size+0x57/0x1c0 [zfs]
 [<ffffffffa039259c>] dnode_set_blksz+0x30c/0x330 [zfs]
 [<ffffffffa037cbb8>] dmu_object_set_blocksize+0x48/0x70 [zfs]
 [<ffffffffa0f15e00>] osd_write_commit+0x770/0x970 [osd_zfs]
 [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd]
 [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
 [<ffffffff81149461>] ? kzfree+0x31/0x40
 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
 [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
 [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff81539896>] ? schedule+0x176/0x3a0
 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
ll_ost_io00_1 D 0000000000000005     0 29763      2 0x00000080
 ffff88021676b768 0000000000000046 0000000000000000 ffff88021676b72c
 ffff880400000000 ffff88043fe82800 00006a78551f88c2 ffff8800387f5a00
 0000000000002c58 0000000106f6922c ffff8803d27b9ad8 ffff88021676bfd8
Call Trace:
 [<ffffffffa030db8e>] ? spl_kmem_zalloc+0x8e/0x190 [spl]
 [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
 [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8153c2e4>] ? down_read+0x24/0x30
 [<ffffffffa0376bb1>] dbuf_dirty+0x381/0x980 [zfs]
 [<ffffffffa037728a>] dbuf_assign_arcbuf+0xda/0x2b0 [zfs]
 [<ffffffffa037f679>] dmu_assign_arcbuf+0x149/0x1f0 [zfs]
 [<ffffffffa0f15ac1>] osd_write_commit+0x431/0x970 [osd_zfs]
 [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd]
 [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
 [<ffffffff81149461>] ? kzfree+0x31/0x40
 [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
 [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
 [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Due to this issue, other service threads can't move forward because this tx can never complete.

The node died when I was trying to collect more information. However, these callback trace are enough to start an initial analysis.



 Comments   
Comment by Cliff White (Inactive) [ 02/Nov/16 ]

At the time of the hang discovery, i dumped debug logs on lola-8 (MGS) lola-5(OSS) and lola-10 (MDS doing recovery)
I then set debug = -1, waited five minutes and re-dumped the OSS and MDS logs. Attached.

Comment by Cliff White (Inactive) [ 02/Nov/16 ]

Lustre version lustre: 2.8.59_62_g165c308
commit 165c3082079d48f70b6e340ad23514e0e7524978

Comment by Cliff White (Inactive) [ 02/Nov/16 ]

Sequence of events:
Noticed excessive reconnection attempts from lola-10 to lola-5,
Nov 1 22:03:32 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
Nov 1 22:04:22 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
Nov 1 22:05:12 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
Nov 1 22:06:02 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
Nov 1 22:06:52 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
On MDT:
Nov 1 17:02:28 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
Nov 1 17:03:18 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
Nov 1 17:04:08 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
Nov 1 17:04:58 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
Examination revealed lola-5 in a hung state apparently due to deadlock above.

Comment by Jinshan Xiong (Inactive) [ 03/Nov/16 ]

It turned out that the root cause of the deadlock is due to block size change and object write at the same time.

Thread 1:

osd_grow_blocksize() with osd_object::oo_guard held
  -> dmu_object_set_blocksize()
    -> dnode_set_blksz(), with dnode_t::dn_struct_rwlock write lock held
      -> dbuf_new_size()
        -> dmu_buf_will_dirty()
          -> dbuf_read()
            -> wait for the dbuf state to change to DB_CACHED

Thread 2:

osd_write_commit()
  -> dmu_assign_arcbuf()
    -> dbuf_assign_arcbuf(), set dbuf state to DB_FILL
      -> dbuf_dirty()
        -> try to hold the read lock of dnode_t::dn_struct_rwlock

In summary, thread 2 sets dbuf to DB_FILL and then try to grab dn_struct_rwlock to complete the assign; while thread 1 holds dn_struct_rwlock and waits for the dbuf state to change to DB_CACHED. Deadlocked.

In order to fix this problem, we can prevent further writes from happening to the same object while it's changing the block size. I would propose to hold read lock of osd_object::oo_guard in osd_write_commit().

Patch is coming soon.

Comment by Gerrit Updater [ 03/Nov/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/23550
Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b8811a0a02c801cc85ee07cb7e68b8514a14b031

Comment by Gerrit Updater [ 07/Nov/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23550/
Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dde6d59ad5c7bb915900bbbe471104ab3c4d3629

Comment by Peter Jones [ 07/Nov/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:20:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.