[LU-12450] MDT ZFS deadlock (task z_wr_iss blocked for more than 120 seconds) Created: 18/Jun/19  Updated: 16/Sep/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Aurelien Degremont (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are able to consistently reproduce a deadlock on a ZFS MDT with a read-mostly heavy workload.

Error could be spotted in the log with this kind of logs:

 

[41567.393252] INFO: task z_wr_iss:9190 blocked for more than 120 seconds.
[41567.402043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[41567.408225] z_wr_iss D 0 9190 2 0x80000080
[41567.412000] Call Trace:
[41567.414576] ? __schedule+0x28e/0x890
[41567.417696] schedule+0x28/0x80
[41567.420528] rwsem_down_write_failed+0x19a/0x3c0
[41567.423927] ? call_rwsem_down_write_failed+0x13/0x20
[41567.427577] call_rwsem_down_write_failed+0x13/0x20
[41567.431161] ? metaslab_alloc+0x639/0x1120 [zfs]
[41567.434635] down_write+0x20/0x30
[41567.437621] dbuf_write_ready+0x1ba/0x2b0 [zfs]
[41567.441034] arc_write_ready+0x9b/0x2f0 [zfs]
[41567.444351] zio_ready+0x42/0x420 [zfs]
[41567.447507] ? tsd_get_by_thread+0x2a/0x40 [spl]
[41567.450959] zio_execute+0x90/0xe0 [zfs]
[41567.454134] taskq_thread+0x2d1/0x530 [spl]
[41567.457463] ? wake_up_q+0x70/0x70
[41567.460411] ? zio_reexecute+0x360/0x360 [zfs]
[41567.463760] ? taskq_thread_spawn+0x50/0x50 [spl]
[41567.467264] kthread+0x11a/0x130
[41567.470156] ? kthread_create_on_node+0x70/0x70
[41567.473569] ret_from_fork+0x35/0x40

 

 

The deadlock was detected several times but we are able to reproduce with this heavy workload:

  • With a filesystem with 1 MDS, 30 OSS, using 50 clients, each running 50 threads.
  • Each thread creates 20K files, total of 1M files per client, 50M files altogether. Each client uses a unique directory, so there are 50 directories with 1M files each. And each file is 1MB in size.
  • Then each thread randomly chooses a file within the 20K that thread previous created, opens it for RW access, chooses a random location on a 4KB byte boundary, and reads a random number of 4KB blocks from that location, and subsequently closes the file.
  • The deadlock occurs in the process of reading the files. It is perhaps significant that even though the threads only read the file, they open it with both READ and WRITE access.
    Deadlock does not occur when volumes are mounted noatime, so updating atime appears to be required.

The reproducer needs several hours to hit the bug, but it will eventually hit it.

Problem is reproduced with Lustre 2.10, 2.12, ZFS 0.7.9 and 0.7.13

 

 

We identify 6 group of processes with specific strack traces. They are all waiting for a dn_struct_rwlock for a DNODE or META_DNODE.

 

 rw_enter(&mdn->dn_struct_rwlock, RW_...);

 

Group 1 (mdt_rdpgxx_xxx  *52)

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a0103>] dbuf_dirty+0x303/0x8a0 [zfs]
[<ffffffffc06bb4aa>] dnode_setdirty+0x8a/0xe0 [zfs]
[<ffffffffc06a01fc>] dbuf_dirty+0x3fc/0x8a0 [zfs]
[<ffffffffc06e6e3d>] sa_attr_op+0x28d/0x3d0 [zfs]
[<ffffffffc06e70d7>] sa_bulk_update_impl+0x67/0x100 [zfs]
[<ffffffffc06e72c0>] sa_bulk_update+0x50/0x90 [zfs]
[<ffffffffc0cdbd4a>] osd_attr_set+0x52a/0xa90 [osd_zfs]
[<ffffffffc14d3c63>] lod_sub_attr_set+0x1f3/0x470 [lod]
[<ffffffffc14c3669>] lod_attr_set+0x109/0x850 [lod]
[<ffffffffc110631f>] mdd_attr_set_internal+0x9f/0x2b0 [mdd]
[<ffffffffc11073d2>] mdd_attr_set+0xb42/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Group 2 (z_wr_iss *48)

[<ffffffffa65f7fc3>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffffc069b79a>] dbuf_write_ready+0x1ba/0x2b0 [zfs]
 [<ffffffffc06961ab>] arc_write_ready+0x9b/0x2f0 [zfs]
 [<ffffffffc0755cb2>] zio_ready+0x42/0x420 [zfs]
 [<ffffffffc07508c0>] zio_execute+0x90/0xe0 [zfs]
 [<ffffffffc03e37b1>] taskq_thread+0x2d1/0x530 [spl]
 [<ffffffffa60a87ba>] kthread+0x11a/0x130:
 [<ffffffffa6800205>] ret_from_fork+0x35/0x40
 [<ffffffffffffffff>] 0xffffffffffffffff

Group 3 (mdt_rdpg01_003)

 

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc069c1f1>] dbuf_read+0x371/0x920 [zfs]
[<ffffffffc06ba729>] dnode_hold_impl+0xe9/0xbf0 [zfs]
[<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
[<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
[<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
[<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
[<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
[<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

 

Group 4 (mdt_rdpgxx_xxx *65)

 

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06ba6b1>] dnode_hold_impl+0x71/0xbf0 [zfs]
[<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
[<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
[<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
[<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
[<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
[<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

 

Group 5 (mdtxx_xxx *71, 2 similar traces)

 

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
[<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
[<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
[<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
[<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
[<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
[<ffffffffc1416813>] mdt_reint_open+0x1d83/0x3020 [mdt]
[<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
[<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
[<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
[<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
[<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
[<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
[<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
[<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

 

and

 

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
[<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
[<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
[<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
[<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
[<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
[<ffffffffc1415ab2>] mdt_reint_open+0x1022/0x3020 [mdt]
[<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
[<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
[<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
[<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
[<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
[<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
[<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
[<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

 

Not the same call path from mdt_reint_open()

 

 



 Comments   
Comment by Andreas Dilger [ 18/Jun/19 ]

I can't say for sure, but this looks like the bug is purely in the ZFS code? As such, it is probably better suited to be an issue in github/zfsonlinux.

Comment by Aurelien Degremont (Inactive) [ 18/Jun/19 ]

I was curious to know what's your point of view on that. I've seen issues in the past which were coming from the way OSD was using LDISKFS or ZFS and not the underlying filesystem itself.

But indeed, It looks like a ZFS issue.

Comment by James A Simmons [ 22/Jul/19 ]

Can you try patch https://review.whamcloud.com/#/c/35524/

Comment by Aurelien Degremont (Inactive) [ 16/Sep/19 ]

For the records, this was related to a kernel bug in 4.14 kernel branch that was fixed by this ticket:

commit 8a5e7aeffc81ed2f1f5371eee7a0b019b37fb13a
Author: Waiman Long longman@redhat.com
Date: Sun Apr 28 17:25:38 2019 -0400
locking/rwsem: Prevent decrement of reader count before increment [ Upstream commit a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 ]

{}

Generated at Sat Feb 10 02:52:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.