[LU-12544] mds marked unhealthy after txg_quiesce thread hanging - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
- ORNL
Environment:
RHEL7.6, Lustre 2.11.0, ZFS 0.7.13

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

While waiting for a resolution to ~~LU-12510~~, we rolled back to our previous production image which is at lustre 2.11.0 and using zfs 0.7.13. We had run into issues, but it was deemed more stable than what we were seeing. Since then, we have repeatedly been hitting this issue which is causing our MDS hosts to get marked unhealthy by lustre.

[12692.736688] INFO: task txg_quiesce:37482 blocked for more than 120 seconds.
[12692.744369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.752901] txg_quiesce     D ffff8afcb464c100     0 37482      2 0x00000000
[12692.760699] Call Trace:
[12692.763852]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.769527]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.776399]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.782916]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.789191]  [<ffffffffc129fc6b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
[12692.796401]  [<ffffffffc129f970>] ? txg_init+0x2b0/0x2b0 [zfs]
[12692.802893]  [<ffffffffc0a0b063>] thread_generic_wrapper+0x73/0x80 [spl]
[12692.810243]  [<ffffffffc0a0aff0>] ? __thread_exit+0x20/0x20 [spl]
[12692.816976]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12692.822482]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.829195]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12692.836264]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.842995] INFO: task mdt01_001:38593 blocked for more than 120 seconds.
[12692.850408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.858865] mdt01_001       D ffff8afc5f579040     0 38593      2 0x00000000
[12692.866600] Call Trace:
[12692.869719]  [<ffffffffc19d4a19>] ? lod_sub_declare_xattr_set+0xf9/0x300 [lod]
[12692.877592]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.883203]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.890015]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.896475]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.902677]  [<ffffffffc1254c33>] dmu_tx_wait+0x213/0x3c0 [zfs]
[12692.909203]  [<ffffffffc1254e72>] dmu_tx_assign+0x92/0x490 [zfs]
[12692.915805]  [<ffffffffc0d35f57>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[12692.922968]  [<ffffffffc1841fa2>] top_trans_start+0x702/0x940 [ptlrpc]
[12692.930062]  [<ffffffffc1a2b173>] ? mdd_declare_create+0x5a3/0xdb0 [mdd]
[12692.937324]  [<ffffffffc199a3f1>] lod_trans_start+0x31/0x40 [lod]
[12692.943964]  [<ffffffffc1a4980a>] mdd_trans_start+0x1a/0x20 [mdd]
[12692.950591]  [<ffffffffc1a2f507>] mdd_create+0xb77/0x13a0 [mdd]
[12692.957049]  [<ffffffffc11146c8>] mdt_reint_open+0x2218/0x3270 [mdt]
[12692.963948]  [<ffffffffc15f4241>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass]
[12692.971950]  [<ffffffffc1108883>] mdt_reint_rec+0x83/0x210 [mdt]
[12692.978467]  [<ffffffffc10e81ab>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[12692.985486]  [<ffffffffc10f4737>] mdt_intent_reint+0x157/0x420 [mdt]
[12692.992317]  [<ffffffffc10eb315>] mdt_intent_opc+0x455/0xae0 [mdt]
[12692.999000]  [<ffffffffc17ca710>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
[12693.007136]  [<ffffffffc10f2f63>] mdt_intent_policy+0x1a3/0x360 [mdt]
[12693.014050]  [<ffffffffc177a235>] ldlm_lock_enqueue+0x385/0x8f0 [ptlrpc]
[12693.021223]  [<ffffffffc17a2913>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
[12693.028726]  [<ffffffffc17ca790>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[12693.036662]  [<ffffffffc1828bf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[12693.043205]  [<ffffffffc182f05a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[12693.050507]  [<ffffffffc17d4843>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[12693.058580]  [<ffffffffc17d16f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[12693.065766]  [<ffffffff844d6802>] ? default_wake_function+0x12/0x20
[12693.072431]  [<ffffffff844cbadb>] ? __wake_up_common+0x5b/0x90
[12693.078674]  [<ffffffffc17d7ff2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[12693.085351]  [<ffffffffc17d7560>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[12693.093130]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12693.098403]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12693.104890]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12693.111718]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40

I wasn't sure if this is related to the same issue we were seeing with 2.12.2 and zfs 0.8.1. The dnodestats did not look like they were getting backed up on lock retries though.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

f2-mds2_lustre_unhealthy_20190713.tgz
14/Jul/19 3:16 AM
18.53 MB
Philip B Curtis
f2-mds2_lustre_unhealthy_20190715.tgz
15/Jul/19 11:15 AM
16.44 MB
Philip B Curtis

Issue Links

duplicates

LU-12510 mds server hangs cv_wait_common

Resolved

mds marked unhealthy after txg_quiesce thread hanging

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates