Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.9.0
    • 3
    • 9223372036854775807

    Description

      This problem occurred in soaked test.

      There are two suspicious call trace which leads me to suspect there exists a deadlock in the code:

      LNet: Service thread pid 30124 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      LNet: Skipped 3 previous similar messages
      Pid: 30124, comm: ll_ost_io00_116
      
      Call Trace:
       [<ffffffff81178260>] ? cache_alloc_refill+0x1c0/0x240
       [<ffffffff81291489>] ? cpumask_next_and+0x29/0x50
       [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
       [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
       [<ffffffff8153c2e4>] ? down_read+0x24/0x30
       [<ffffffffa0f13917>] osd_attr_get+0x77/0x2e0 [osd_zfs]
       [<ffffffffa1022301>] ofd_attr_handle_ugid+0x161/0x430 [ofd]
       [<ffffffffa1027ebd>] ofd_write_attr_set+0xad/0x9e0 [ofd]
       [<ffffffffa102933f>] ofd_commitrw_write+0x23f/0x10a0 [ofd]
       [<ffffffffa102df1d>] ? ofd_fmd_find_nolock+0xad/0xd0 [ofd]
       [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
       [<ffffffff81149461>] ? kzfree+0x31/0x40
       [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
       [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
       [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
       [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
       [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
       [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
       [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
       [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
       [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
       [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      And then another one:

      Call Trace:
       [<ffffffffa015208c>] ? mlx4_ib_post_send+0x4fc/0x1370 [mlx4_ib]
       [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
       [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
       [<ffffffff8153c2e4>] ? down_read+0x24/0x30
       [<ffffffffa037e9b0>] dmu_buf_hold_array_by_dnode+0x50/0x4a0 [zfs]
       [<ffffffffa037f85b>] dmu_buf_hold_array_by_bonus+0x6b/0x90 [zfs]
       [<ffffffffa0f149af>] osd_bufs_get+0x4bf/0xb50 [osd_zfs]
       [<ffffffffa07c0af6>] ? lu_object_find+0x16/0x20 [obdclass]
       [<ffffffffa102893f>] ofd_preprw_read+0x14f/0x910 [ofd]
       [<ffffffffa102b008>] ofd_preprw+0x358/0x14f0 [ofd]
       [<ffffffffa0a1774f>] obd_preprw+0x10f/0x380 [ptlrpc]
       [<ffffffffa0a217c5>] tgt_brw_read+0x695/0x11d0 [ptlrpc]
       [<ffffffffa09e4194>] ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
       [<ffffffffa09bab8b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
       [<ffffffffa09bacc6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
       [<ffffffffa067ec8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
       [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
       [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
       [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
       [<ffffffff81539896>] ? schedule+0x176/0x3a0
       [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      
      Pid: 29297, comm: ll_ost_io00_003
      

      Two more:

      ll_ost_io00_0 D 0000000000000003     0 29439      2 0x00000080
       ffff8803d6a1f6f0 0000000000000046 0000000000000000 ffff8803d6a1f6b4
       0000000000000000 ffff88043fe82400 00006a78551f8e00 ffff8800387f5a00
       0000000000002c62 0000000106f6922c ffff8803d6a1bad8 ffff8803d6a1ffd8
      Call Trace:
       [<ffffffffa0314e5d>] cv_wait_common+0x11d/0x130 [spl]
       [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0314ec5>] __cv_wait+0x15/0x20 [spl]
       [<ffffffffa03757fb>] dbuf_read+0x3bb/0x850 [zfs]
       [<ffffffff8153afdb>] ? mutex_lock+0x2b/0x50
       [<ffffffffa0372a13>] ? dbuf_find+0x193/0x1a0 [zfs]
       [<ffffffffa0377c78>] dmu_buf_will_dirty+0x58/0xc0 [zfs]
       [<ffffffffa0377d37>] dbuf_new_size+0x57/0x1c0 [zfs]
       [<ffffffffa039259c>] dnode_set_blksz+0x30c/0x330 [zfs]
       [<ffffffffa037cbb8>] dmu_object_set_blocksize+0x48/0x70 [zfs]
       [<ffffffffa0f15e00>] osd_write_commit+0x770/0x970 [osd_zfs]
       [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd]
       [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
       [<ffffffff81149461>] ? kzfree+0x31/0x40
       [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
       [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
       [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
       [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
       [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
       [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
       [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
       [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
       [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
       [<ffffffff81539896>] ? schedule+0x176/0x3a0
       [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      
      ll_ost_io00_1 D 0000000000000005     0 29763      2 0x00000080
       ffff88021676b768 0000000000000046 0000000000000000 ffff88021676b72c
       ffff880400000000 ffff88043fe82800 00006a78551f88c2 ffff8800387f5a00
       0000000000002c58 0000000106f6922c ffff8803d27b9ad8 ffff88021676bfd8
      Call Trace:
       [<ffffffffa030db8e>] ? spl_kmem_zalloc+0x8e/0x190 [spl]
       [<ffffffff8153cc55>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
       [<ffffffff8153cde6>] rwsem_down_read_failed+0x26/0x30
       [<ffffffff8129e8f4>] call_rwsem_down_read_failed+0x14/0x30
       [<ffffffff8153c2e4>] ? down_read+0x24/0x30
       [<ffffffffa0376bb1>] dbuf_dirty+0x381/0x980 [zfs]
       [<ffffffffa037728a>] dbuf_assign_arcbuf+0xda/0x2b0 [zfs]
       [<ffffffffa037f679>] dmu_assign_arcbuf+0x149/0x1f0 [zfs]
       [<ffffffffa0f15ac1>] osd_write_commit+0x431/0x970 [osd_zfs]
       [<ffffffffa102980b>] ofd_commitrw_write+0x70b/0x10a0 [ofd]
       [<ffffffffa102a75f>] ofd_commitrw+0x5bf/0xb10 [ofd]
       [<ffffffff81149461>] ? kzfree+0x31/0x40
       [<ffffffffa079d5c1>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
       [<ffffffffa0a17ad4>] obd_commitrw+0x114/0x380 [ptlrpc]
       [<ffffffffa0a20870>] tgt_brw_write+0xc70/0x1530 [ptlrpc]
       [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
       [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
       [<ffffffff810674be>] ? try_to_wake_up+0x24e/0x3e0
       [<ffffffffa0974e20>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa0a1f0ac>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
       [<ffffffffa09cb7b1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
       [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
       [<ffffffffa09caa80>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
       [<ffffffff810a138e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff810a12f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Due to this issue, other service threads can't move forward because this tx can never complete.

      The node died when I was trying to collect more information. However, these callback trace are enough to start an initial analysis.

      Attachments

        1. lola-10.lfsck.full-debug.txt.gz
          4.50 MB
        2. lola-10.lfsck.hang.txt.gz
          94 kB
        3. lola-5.mount.fulldebug.txt.gz
          4.49 MB
        4. lola-5.mount.hang.txt.gz
          2.52 MB
        5. lola-8.lfsck.hang.txt.gz
          0.2 kB

        Activity

          [LU-8791] Suspicious deadlock on OST stack
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23550/
          Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: dde6d59ad5c7bb915900bbbe471104ab3c4d3629

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23550/ Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write Project: fs/lustre-release Branch: master Current Patch Set: Commit: dde6d59ad5c7bb915900bbbe471104ab3c4d3629

          Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/23550
          Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b8811a0a02c801cc85ee07cb7e68b8514a14b031

          gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/23550 Subject: LU-8791 osd-zfs: hold oo_guard read lock for object write Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b8811a0a02c801cc85ee07cb7e68b8514a14b031
          jay Jinshan Xiong (Inactive) added a comment - - edited

          It turned out that the root cause of the deadlock is due to block size change and object write at the same time.

          Thread 1:

          osd_grow_blocksize() with osd_object::oo_guard held
            -> dmu_object_set_blocksize()
              -> dnode_set_blksz(), with dnode_t::dn_struct_rwlock write lock held
                -> dbuf_new_size()
                  -> dmu_buf_will_dirty()
                    -> dbuf_read()
                      -> wait for the dbuf state to change to DB_CACHED
          

          Thread 2:

          osd_write_commit()
            -> dmu_assign_arcbuf()
              -> dbuf_assign_arcbuf(), set dbuf state to DB_FILL
                -> dbuf_dirty()
                  -> try to hold the read lock of dnode_t::dn_struct_rwlock
          

          In summary, thread 2 sets dbuf to DB_FILL and then try to grab dn_struct_rwlock to complete the assign; while thread 1 holds dn_struct_rwlock and waits for the dbuf state to change to DB_CACHED. Deadlocked.

          In order to fix this problem, we can prevent further writes from happening to the same object while it's changing the block size. I would propose to hold read lock of osd_object::oo_guard in osd_write_commit().

          Patch is coming soon.

          jay Jinshan Xiong (Inactive) added a comment - - edited It turned out that the root cause of the deadlock is due to block size change and object write at the same time. Thread 1: osd_grow_blocksize() with osd_object::oo_guard held -> dmu_object_set_blocksize() -> dnode_set_blksz(), with dnode_t::dn_struct_rwlock write lock held -> dbuf_new_size() -> dmu_buf_will_dirty() -> dbuf_read() -> wait for the dbuf state to change to DB_CACHED Thread 2: osd_write_commit() -> dmu_assign_arcbuf() -> dbuf_assign_arcbuf(), set dbuf state to DB_FILL -> dbuf_dirty() -> try to hold the read lock of dnode_t::dn_struct_rwlock In summary, thread 2 sets dbuf to DB_FILL and then try to grab dn_struct_rwlock to complete the assign; while thread 1 holds dn_struct_rwlock and waits for the dbuf state to change to DB_CACHED. Deadlocked. In order to fix this problem, we can prevent further writes from happening to the same object while it's changing the block size. I would propose to hold read lock of osd_object::oo_guard in osd_write_commit(). Patch is coming soon.

          Sequence of events:
          Noticed excessive reconnection attempts from lola-10 to lola-5,
          Nov 1 22:03:32 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
          Nov 1 22:04:22 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
          Nov 1 22:05:12 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
          Nov 1 22:06:02 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
          Nov 1 22:06:52 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10
          On MDT:
          Nov 1 17:02:28 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
          Nov 1 17:03:18 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
          Nov 1 17:04:08 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
          Nov 1 17:04:58 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114
          Examination revealed lola-5 in a hung state apparently due to deadlock above.

          cliffw Cliff White (Inactive) added a comment - Sequence of events: Noticed excessive reconnection attempts from lola-10 to lola-5, Nov 1 22:03:32 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10 Nov 1 22:04:22 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10 Nov 1 22:05:12 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10 Nov 1 22:06:02 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10 Nov 1 22:06:52 lola-5 kernel: Lustre: soaked-OST000f: Export ffff880310f5bc00 already connecting from 192.168.1.110@o2ib10 On MDT: Nov 1 17:02:28 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114 Nov 1 17:03:18 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114 Nov 1 17:04:08 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114 Nov 1 17:04:58 lola-10 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation mds_connect to node 192.168.1.108@o2ib10 failed: rc = -114 Examination revealed lola-5 in a hung state apparently due to deadlock above.

          Lustre version lustre: 2.8.59_62_g165c308
          commit 165c3082079d48f70b6e340ad23514e0e7524978

          cliffw Cliff White (Inactive) added a comment - Lustre version lustre: 2.8.59_62_g165c308 commit 165c3082079d48f70b6e340ad23514e0e7524978

          At the time of the hang discovery, i dumped debug logs on lola-8 (MGS) lola-5(OSS) and lola-10 (MDS doing recovery)
          I then set debug = -1, waited five minutes and re-dumped the OSS and MDS logs. Attached.

          cliffw Cliff White (Inactive) added a comment - At the time of the hang discovery, i dumped debug logs on lola-8 (MGS) lola-5(OSS) and lola-10 (MDS doing recovery) I then set debug = -1, waited five minutes and re-dumped the OSS and MDS logs. Attached.

          People

            jay Jinshan Xiong (Inactive)
            jay Jinshan Xiong (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: