Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.2
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

We are able to consistently reproduce a deadlock on a ZFS MDT with a read-mostly heavy workload.

Error could be spotted in the log with this kind of logs:

[41567.393252] INFO: task z_wr_iss:9190 blocked for more than 120 seconds.
[41567.402043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[41567.408225] z_wr_iss D 0 9190 2 0x80000080
[41567.412000] Call Trace:
[41567.414576] ? __schedule+0x28e/0x890
[41567.417696] schedule+0x28/0x80
[41567.420528] rwsem_down_write_failed+0x19a/0x3c0
[41567.423927] ? call_rwsem_down_write_failed+0x13/0x20
[41567.427577] call_rwsem_down_write_failed+0x13/0x20
[41567.431161] ? metaslab_alloc+0x639/0x1120 [zfs]
[41567.434635] down_write+0x20/0x30
[41567.437621] dbuf_write_ready+0x1ba/0x2b0 [zfs]
[41567.441034] arc_write_ready+0x9b/0x2f0 [zfs]
[41567.444351] zio_ready+0x42/0x420 [zfs]
[41567.447507] ? tsd_get_by_thread+0x2a/0x40 [spl]
[41567.450959] zio_execute+0x90/0xe0 [zfs]
[41567.454134] taskq_thread+0x2d1/0x530 [spl]
[41567.457463] ? wake_up_q+0x70/0x70
[41567.460411] ? zio_reexecute+0x360/0x360 [zfs]
[41567.463760] ? taskq_thread_spawn+0x50/0x50 [spl]
[41567.467264] kthread+0x11a/0x130
[41567.470156] ? kthread_create_on_node+0x70/0x70
[41567.473569] ret_from_fork+0x35/0x40

The deadlock was detected several times but we are able to reproduce with this heavy workload:

With a filesystem with 1 MDS, 30 OSS, using 50 clients, each running 50 threads.
Each thread creates 20K files, total of 1M files per client, 50M files altogether. Each client uses a unique directory, so there are 50 directories with 1M files each. And each file is 1MB in size.
Then each thread randomly chooses a file within the 20K that thread previous created, opens it for RW access, chooses a random location on a 4KB byte boundary, and reads a random number of 4KB blocks from that location, and subsequently closes the file.
The deadlock occurs in the process of reading the files. It is perhaps significant that even though the threads only read the file, they open it with both READ and WRITE access.
Deadlock does not occur when volumes are mounted noatime, so updating atime appears to be required.

The reproducer needs several hours to hit the bug, but it will eventually hit it.

Problem is reproduced with Lustre 2.10, 2.12, ZFS 0.7.9 and 0.7.13

We identify 6 group of processes with specific strack traces. They are all waiting for a dn_struct_rwlock for a DNODE or META_DNODE.

 rw_enter(&mdn->dn_struct_rwlock, RW_...);

Group 1 (mdt_rdpgxx_xxx *52)

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a0103>] dbuf_dirty+0x303/0x8a0 [zfs]
[<ffffffffc06bb4aa>] dnode_setdirty+0x8a/0xe0 [zfs]
[<ffffffffc06a01fc>] dbuf_dirty+0x3fc/0x8a0 [zfs]
[<ffffffffc06e6e3d>] sa_attr_op+0x28d/0x3d0 [zfs]
[<ffffffffc06e70d7>] sa_bulk_update_impl+0x67/0x100 [zfs]
[<ffffffffc06e72c0>] sa_bulk_update+0x50/0x90 [zfs]
[<ffffffffc0cdbd4a>] osd_attr_set+0x52a/0xa90 [osd_zfs]
[<ffffffffc14d3c63>] lod_sub_attr_set+0x1f3/0x470 [lod]
[<ffffffffc14c3669>] lod_attr_set+0x109/0x850 [lod]
[<ffffffffc110631f>] mdd_attr_set_internal+0x9f/0x2b0 [mdd]
[<ffffffffc11073d2>] mdd_attr_set+0xb42/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Group 2 (z_wr_iss *48)

[<ffffffffa65f7fc3>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffffc069b79a>] dbuf_write_ready+0x1ba/0x2b0 [zfs]
 [<ffffffffc06961ab>] arc_write_ready+0x9b/0x2f0 [zfs]
 [<ffffffffc0755cb2>] zio_ready+0x42/0x420 [zfs]
 [<ffffffffc07508c0>] zio_execute+0x90/0xe0 [zfs]
 [<ffffffffc03e37b1>] taskq_thread+0x2d1/0x530 [spl]
 [<ffffffffa60a87ba>] kthread+0x11a/0x130:
 [<ffffffffa6800205>] ret_from_fork+0x35/0x40
 [<ffffffffffffffff>] 0xffffffffffffffff

Group 3 (mdt_rdpg01_003)

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc069c1f1>] dbuf_read+0x371/0x920 [zfs]
[<ffffffffc06ba729>] dnode_hold_impl+0xe9/0xbf0 [zfs]
[<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
[<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
[<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
[<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
[<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
[<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Group 4 (mdt_rdpgxx_xxx *65)

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06ba6b1>] dnode_hold_impl+0x71/0xbf0 [zfs]
[<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
[<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
[<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
[<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
[<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
[<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
[<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
[<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
[<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Group 5 (mdtxx_xxx *71, 2 similar traces)

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
[<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
[<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
[<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
[<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
[<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
[<ffffffffc1416813>] mdt_reint_open+0x1d83/0x3020 [mdt]
[<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
[<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
[<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
[<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
[<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
[<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
[<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
[<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

and

[<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
[<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
[<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
[<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
[<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
[<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
[<ffffffffc1415ab2>] mdt_reint_open+0x1022/0x3020 [mdt]
[<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
[<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
[<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
[<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
[<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
[<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
[<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
[<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
[<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
[<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
[<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
[<ffffffffa60a87ba>] kthread+0x11a/0x130
[<ffffffffa6800205>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Not the same call path from mdt_reint_open()

Assignee:: WC Triage

Reporter:: Aurelien Degremont (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 18/Jun/19 10:48 AM

Updated:: 16/Sep/19 9:35 PM

Details

Description

Group 1 (mdt_rdpgxx_xxx *52)

Group 2 (z_wr_iss *48)

Group 3 (mdt_rdpg01_003)

Group 4 (mdt_rdpgxx_xxx *65)

Group 5 (mdtxx_xxx *71, 2 similar traces)

Attachments

Activity

People

Dates