Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12544

mds marked unhealthy after txg_quiesce thread hanging

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.11.0
    • RHEL7.6, Lustre 2.11.0, ZFS 0.7.13
    • 2
    • 9223372036854775807

    Description

      While waiting for a resolution to LU-12510, we rolled back to our previous production image which is at lustre 2.11.0 and using zfs 0.7.13. We had run into issues, but it was deemed more stable than what we were seeing. Since then, we have repeatedly been hitting this issue which is causing our MDS hosts to get marked unhealthy by lustre.

      [12692.736688] INFO: task txg_quiesce:37482 blocked for more than 120 seconds.
      [12692.744369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12692.752901] txg_quiesce     D ffff8afcb464c100     0 37482      2 0x00000000
      [12692.760699] Call Trace:
      [12692.763852]  [<ffffffff84b67c49>] schedule+0x29/0x70
      [12692.769527]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
      [12692.776399]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
      [12692.782916]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
      [12692.789191]  [<ffffffffc129fc6b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
      [12692.796401]  [<ffffffffc129f970>] ? txg_init+0x2b0/0x2b0 [zfs]
      [12692.802893]  [<ffffffffc0a0b063>] thread_generic_wrapper+0x73/0x80 [spl]
      [12692.810243]  [<ffffffffc0a0aff0>] ? __thread_exit+0x20/0x20 [spl]
      [12692.816976]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
      [12692.822482]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
      [12692.829195]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [12692.836264]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
      [12692.842995] INFO: task mdt01_001:38593 blocked for more than 120 seconds.
      [12692.850408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12692.858865] mdt01_001       D ffff8afc5f579040     0 38593      2 0x00000000
      [12692.866600] Call Trace:
      [12692.869719]  [<ffffffffc19d4a19>] ? lod_sub_declare_xattr_set+0xf9/0x300 [lod]
      [12692.877592]  [<ffffffff84b67c49>] schedule+0x29/0x70
      [12692.883203]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
      [12692.890015]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
      [12692.896475]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
      [12692.902677]  [<ffffffffc1254c33>] dmu_tx_wait+0x213/0x3c0 [zfs]
      [12692.909203]  [<ffffffffc1254e72>] dmu_tx_assign+0x92/0x490 [zfs]
      [12692.915805]  [<ffffffffc0d35f57>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
      [12692.922968]  [<ffffffffc1841fa2>] top_trans_start+0x702/0x940 [ptlrpc]
      [12692.930062]  [<ffffffffc1a2b173>] ? mdd_declare_create+0x5a3/0xdb0 [mdd]
      [12692.937324]  [<ffffffffc199a3f1>] lod_trans_start+0x31/0x40 [lod]
      [12692.943964]  [<ffffffffc1a4980a>] mdd_trans_start+0x1a/0x20 [mdd]
      [12692.950591]  [<ffffffffc1a2f507>] mdd_create+0xb77/0x13a0 [mdd]
      [12692.957049]  [<ffffffffc11146c8>] mdt_reint_open+0x2218/0x3270 [mdt]
      [12692.963948]  [<ffffffffc15f4241>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass]
      [12692.971950]  [<ffffffffc1108883>] mdt_reint_rec+0x83/0x210 [mdt]
      [12692.978467]  [<ffffffffc10e81ab>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [12692.985486]  [<ffffffffc10f4737>] mdt_intent_reint+0x157/0x420 [mdt]
      [12692.992317]  [<ffffffffc10eb315>] mdt_intent_opc+0x455/0xae0 [mdt]
      [12692.999000]  [<ffffffffc17ca710>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
      [12693.007136]  [<ffffffffc10f2f63>] mdt_intent_policy+0x1a3/0x360 [mdt]
      [12693.014050]  [<ffffffffc177a235>] ldlm_lock_enqueue+0x385/0x8f0 [ptlrpc]
      [12693.021223]  [<ffffffffc17a2913>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
      [12693.028726]  [<ffffffffc17ca790>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [12693.036662]  [<ffffffffc1828bf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [12693.043205]  [<ffffffffc182f05a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [12693.050507]  [<ffffffffc17d4843>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [12693.058580]  [<ffffffffc17d16f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [12693.065766]  [<ffffffff844d6802>] ? default_wake_function+0x12/0x20
      [12693.072431]  [<ffffffff844cbadb>] ? __wake_up_common+0x5b/0x90
      [12693.078674]  [<ffffffffc17d7ff2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [12693.085351]  [<ffffffffc17d7560>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
      [12693.093130]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
      [12693.098403]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
      [12693.104890]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
      [12693.111718]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
      

      I wasn't sure if this is related to the same issue we were seeing with 2.12.2 and zfs 0.8.1. The dnodestats did not look like they were getting backed up on lock retries though.

      Attachments

        1. f2-mds2_lustre_unhealthy_20190713.tgz
          18.53 MB
          Philip B Curtis
        2. f2-mds2_lustre_unhealthy_20190715.tgz
          16.44 MB
          Philip B Curtis

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              curtispb Philip B Curtis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: