Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12450

MDT ZFS deadlock (task z_wr_iss blocked for more than 120 seconds)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.5, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      We are able to consistently reproduce a deadlock on a ZFS MDT with a read-mostly heavy workload.

      Error could be spotted in the log with this kind of logs:

       

      [41567.393252] INFO: task z_wr_iss:9190 blocked for more than 120 seconds.
      [41567.402043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [41567.408225] z_wr_iss D 0 9190 2 0x80000080
      [41567.412000] Call Trace:
      [41567.414576] ? __schedule+0x28e/0x890
      [41567.417696] schedule+0x28/0x80
      [41567.420528] rwsem_down_write_failed+0x19a/0x3c0
      [41567.423927] ? call_rwsem_down_write_failed+0x13/0x20
      [41567.427577] call_rwsem_down_write_failed+0x13/0x20
      [41567.431161] ? metaslab_alloc+0x639/0x1120 [zfs]
      [41567.434635] down_write+0x20/0x30
      [41567.437621] dbuf_write_ready+0x1ba/0x2b0 [zfs]
      [41567.441034] arc_write_ready+0x9b/0x2f0 [zfs]
      [41567.444351] zio_ready+0x42/0x420 [zfs]
      [41567.447507] ? tsd_get_by_thread+0x2a/0x40 [spl]
      [41567.450959] zio_execute+0x90/0xe0 [zfs]
      [41567.454134] taskq_thread+0x2d1/0x530 [spl]
      [41567.457463] ? wake_up_q+0x70/0x70
      [41567.460411] ? zio_reexecute+0x360/0x360 [zfs]
      [41567.463760] ? taskq_thread_spawn+0x50/0x50 [spl]
      [41567.467264] kthread+0x11a/0x130
      [41567.470156] ? kthread_create_on_node+0x70/0x70
      [41567.473569] ret_from_fork+0x35/0x40

       

       

      The deadlock was detected several times but we are able to reproduce with this heavy workload:

      • With a filesystem with 1 MDS, 30 OSS, using 50 clients, each running 50 threads.
      • Each thread creates 20K files, total of 1M files per client, 50M files altogether. Each client uses a unique directory, so there are 50 directories with 1M files each. And each file is 1MB in size.
      • Then each thread randomly chooses a file within the 20K that thread previous created, opens it for RW access, chooses a random location on a 4KB byte boundary, and reads a random number of 4KB blocks from that location, and subsequently closes the file.
      • The deadlock occurs in the process of reading the files. It is perhaps significant that even though the threads only read the file, they open it with both READ and WRITE access.
        Deadlock does not occur when volumes are mounted noatime, so updating atime appears to be required.

      The reproducer needs several hours to hit the bug, but it will eventually hit it.

      Problem is reproduced with Lustre 2.10, 2.12, ZFS 0.7.9 and 0.7.13

       

       

      We identify 6 group of processes with specific strack traces. They are all waiting for a dn_struct_rwlock for a DNODE or META_DNODE.

       

       rw_enter(&mdn->dn_struct_rwlock, RW_...);

       

      Group 1 (mdt_rdpgxx_xxx  *52)

      [<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffc06a0103>] dbuf_dirty+0x303/0x8a0 [zfs]
      [<ffffffffc06bb4aa>] dnode_setdirty+0x8a/0xe0 [zfs]
      [<ffffffffc06a01fc>] dbuf_dirty+0x3fc/0x8a0 [zfs]
      [<ffffffffc06e6e3d>] sa_attr_op+0x28d/0x3d0 [zfs]
      [<ffffffffc06e70d7>] sa_bulk_update_impl+0x67/0x100 [zfs]
      [<ffffffffc06e72c0>] sa_bulk_update+0x50/0x90 [zfs]
      [<ffffffffc0cdbd4a>] osd_attr_set+0x52a/0xa90 [osd_zfs]
      [<ffffffffc14d3c63>] lod_sub_attr_set+0x1f3/0x470 [lod]
      [<ffffffffc14c3669>] lod_attr_set+0x109/0x850 [lod]
      [<ffffffffc110631f>] mdd_attr_set_internal+0x9f/0x2b0 [mdd]
      [<ffffffffc11073d2>] mdd_attr_set+0xb42/0xec0 [mdd]
      [<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
      [<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
      [<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
      [<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
      [<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
      [<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
      [<ffffffffa60a87ba>] kthread+0x11a/0x130
      [<ffffffffa6800205>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff

      Group 2 (z_wr_iss *48)

      [<ffffffffa65f7fc3>] call_rwsem_down_write_failed+0x13/0x20
       [<ffffffffc069b79a>] dbuf_write_ready+0x1ba/0x2b0 [zfs]
       [<ffffffffc06961ab>] arc_write_ready+0x9b/0x2f0 [zfs]
       [<ffffffffc0755cb2>] zio_ready+0x42/0x420 [zfs]
       [<ffffffffc07508c0>] zio_execute+0x90/0xe0 [zfs]
       [<ffffffffc03e37b1>] taskq_thread+0x2d1/0x530 [spl]
       [<ffffffffa60a87ba>] kthread+0x11a/0x130:
       [<ffffffffa6800205>] ret_from_fork+0x35/0x40
       [<ffffffffffffffff>] 0xffffffffffffffff

      Group 3 (mdt_rdpg01_003)

       

      [<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffc069c1f1>] dbuf_read+0x371/0x920 [zfs]
      [<ffffffffc06ba729>] dnode_hold_impl+0xe9/0xbf0 [zfs]
      [<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
      [<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
      [<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
      [<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
      [<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
      [<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
      [<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
      [<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
      [<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
      [<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
      [<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
      [<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
      [<ffffffffa60a87ba>] kthread+0x11a/0x130
      [<ffffffffa6800205>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff

       

      Group 4 (mdt_rdpgxx_xxx *65)

       

      [<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffc06ba6b1>] dnode_hold_impl+0x71/0xbf0 [zfs]
      [<ffffffffc06b7370>] dmu_tx_hold_object_impl+0x50/0xc0 [zfs]
      [<ffffffffc06b73fc>] dmu_tx_hold_bonus+0x1c/0x30 [zfs]
      [<ffffffffc0cda823>] osd_declare_attr_set+0x223/0x7b0 [osd_zfs]
      [<ffffffffc14d382c>] lod_sub_declare_attr_set+0xcc/0x310 [lod]
      [<ffffffffc14c3eb4>] lod_declare_attr_set+0x104/0x550 [lod]
      [<ffffffffc1106d1c>] mdd_attr_set+0x48c/0xec0 [mdd]
      [<ffffffffc141302e>] mdt_mfd_close+0x23e/0x620 [mdt]
      [<ffffffffc1418a70>] mdt_close_internal+0x1e0/0x250 [mdt]
      [<ffffffffc1418ec5>] mdt_close+0x3e5/0x840 [mdt]
      [<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
      [<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
      [<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
      [<ffffffffa60a87ba>] kthread+0x11a/0x130
      [<ffffffffa6800205>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff

       

      Group 5 (mdtxx_xxx *71, 2 similar traces)

       

      [<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
      [<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
      [<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
      [<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
      [<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
      [<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
      [<ffffffffc1416813>] mdt_reint_open+0x1d83/0x3020 [mdt]
      [<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
      [<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
      [<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
      [<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
      [<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
      [<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
      [<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
      [<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
      [<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
      [<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
      [<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
      [<ffffffffa60a87ba>] kthread+0x11a/0x130
      [<ffffffffa6800205>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff

       

      and

       

      [<ffffffffa65f7f94>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffc06a5205>] dmu_prefetch+0x75/0x1d0 [zfs]
      [<ffffffffc0ce32b5>] osd_fid_lookup+0x105/0x350 [osd_zfs]
      [<ffffffffc0cdca8f>] osd_object_init+0xcf/0x10a0 [osd_zfs]
      [<ffffffffc0eeb7a8>] lu_object_alloc+0xc8/0x2e0 [obdclass]
      [<ffffffffc0eebb8a>] lu_object_find_at+0x17a/0x2a0 [obdclass]
      [<ffffffffc13f16e9>] mdt_object_find+0x39/0x150 [mdt]
      [<ffffffffc1415ab2>] mdt_reint_open+0x1022/0x3020 [mdt]
      [<ffffffffc140b8df>] mdt_reint_rec+0x7f/0x250 [mdt]
      [<ffffffffc13eea9e>] mdt_reint_internal+0x5ee/0x680 [mdt]
      [<ffffffffc13eeca3>] mdt_intent_reint+0x173/0x3e0 [mdt]
      [<ffffffffc13f1b85>] mdt_intent_opc+0x385/0xa00 [mdt]
      [<ffffffffc13f9675>] mdt_intent_policy+0x1c5/0x370 [mdt]
      [<ffffffffc11a5ba4>] ldlm_lock_enqueue+0x234/0x950 [ptlrpc]
      [<ffffffffc11cef7b>] ldlm_handle_enqueue0+0xa0b/0x1750 [ptlrpc]
      [<ffffffffc1252b8b>] tgt_enqueue+0x5b/0x200 [ptlrpc]
      [<ffffffffc1256a84>] tgt_request_handle+0x814/0x14b0 [ptlrpc]
      [<ffffffffc11ffbb7>] ptlrpc_server_handle_request+0x2c7/0xb70 [ptlrpc]
      [<ffffffffc1203ea0>] ptlrpc_main+0xb20/0x1f00 [ptlrpc]
      [<ffffffffa60a87ba>] kthread+0x11a/0x130
      [<ffffffffa6800205>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff

       

      Not the same call path from mdt_reint_open()

       

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            degremoa Aurelien Degremont (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: