[LU-12544] mds marked unhealthy after txg_quiesce thread hanging Created: 14/Jul/19  Updated: 22/Aug/19  Resolved: 22/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Philip B Curtis Assignee: Alex Zhuravlev
Resolution: Not a Bug Votes: 0
Labels: ORNL
Environment:

RHEL7.6, Lustre 2.11.0, ZFS 0.7.13


Attachments: File f2-mds2_lustre_unhealthy_20190713.tgz     File f2-mds2_lustre_unhealthy_20190715.tgz    
Issue Links:
Duplicate
duplicates LU-12510 mds server hangs cv_wait_common Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

While waiting for a resolution to LU-12510, we rolled back to our previous production image which is at lustre 2.11.0 and using zfs 0.7.13. We had run into issues, but it was deemed more stable than what we were seeing. Since then, we have repeatedly been hitting this issue which is causing our MDS hosts to get marked unhealthy by lustre.

[12692.736688] INFO: task txg_quiesce:37482 blocked for more than 120 seconds.
[12692.744369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.752901] txg_quiesce     D ffff8afcb464c100     0 37482      2 0x00000000
[12692.760699] Call Trace:
[12692.763852]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.769527]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.776399]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.782916]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.789191]  [<ffffffffc129fc6b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
[12692.796401]  [<ffffffffc129f970>] ? txg_init+0x2b0/0x2b0 [zfs]
[12692.802893]  [<ffffffffc0a0b063>] thread_generic_wrapper+0x73/0x80 [spl]
[12692.810243]  [<ffffffffc0a0aff0>] ? __thread_exit+0x20/0x20 [spl]
[12692.816976]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12692.822482]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.829195]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12692.836264]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.842995] INFO: task mdt01_001:38593 blocked for more than 120 seconds.
[12692.850408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.858865] mdt01_001       D ffff8afc5f579040     0 38593      2 0x00000000
[12692.866600] Call Trace:
[12692.869719]  [<ffffffffc19d4a19>] ? lod_sub_declare_xattr_set+0xf9/0x300 [lod]
[12692.877592]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.883203]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.890015]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.896475]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.902677]  [<ffffffffc1254c33>] dmu_tx_wait+0x213/0x3c0 [zfs]
[12692.909203]  [<ffffffffc1254e72>] dmu_tx_assign+0x92/0x490 [zfs]
[12692.915805]  [<ffffffffc0d35f57>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[12692.922968]  [<ffffffffc1841fa2>] top_trans_start+0x702/0x940 [ptlrpc]
[12692.930062]  [<ffffffffc1a2b173>] ? mdd_declare_create+0x5a3/0xdb0 [mdd]
[12692.937324]  [<ffffffffc199a3f1>] lod_trans_start+0x31/0x40 [lod]
[12692.943964]  [<ffffffffc1a4980a>] mdd_trans_start+0x1a/0x20 [mdd]
[12692.950591]  [<ffffffffc1a2f507>] mdd_create+0xb77/0x13a0 [mdd]
[12692.957049]  [<ffffffffc11146c8>] mdt_reint_open+0x2218/0x3270 [mdt]
[12692.963948]  [<ffffffffc15f4241>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass]
[12692.971950]  [<ffffffffc1108883>] mdt_reint_rec+0x83/0x210 [mdt]
[12692.978467]  [<ffffffffc10e81ab>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[12692.985486]  [<ffffffffc10f4737>] mdt_intent_reint+0x157/0x420 [mdt]
[12692.992317]  [<ffffffffc10eb315>] mdt_intent_opc+0x455/0xae0 [mdt]
[12692.999000]  [<ffffffffc17ca710>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
[12693.007136]  [<ffffffffc10f2f63>] mdt_intent_policy+0x1a3/0x360 [mdt]
[12693.014050]  [<ffffffffc177a235>] ldlm_lock_enqueue+0x385/0x8f0 [ptlrpc]
[12693.021223]  [<ffffffffc17a2913>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
[12693.028726]  [<ffffffffc17ca790>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[12693.036662]  [<ffffffffc1828bf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[12693.043205]  [<ffffffffc182f05a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[12693.050507]  [<ffffffffc17d4843>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[12693.058580]  [<ffffffffc17d16f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[12693.065766]  [<ffffffff844d6802>] ? default_wake_function+0x12/0x20
[12693.072431]  [<ffffffff844cbadb>] ? __wake_up_common+0x5b/0x90
[12693.078674]  [<ffffffffc17d7ff2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[12693.085351]  [<ffffffffc17d7560>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[12693.093130]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12693.098403]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12693.104890]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12693.111718]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40

I wasn't sure if this is related to the same issue we were seeing with 2.12.2 and zfs 0.8.1. The dnodestats did not look like they were getting backed up on lock retries though.



 Comments   
Comment by Philip B Curtis [ 14/Jul/19 ]

Attaching the dmesg output from the host before I ran a crashdump as well as a copy of the zfs kstats.

Comment by Peter Jones [ 14/Jul/19 ]

Alex

Any advice here?

Peter

Comment by Alex Zhuravlev [ 15/Jul/19 ]

it would be great to see all backtraces for the case. it looks like one (few) OSTs were down?
also, the trace in the description can't be found in the logs you attached. they look like a different cases?

Comment by Philip B Curtis [ 15/Jul/19 ]

We hit this issue again, just now. I am attaching the new data which is hopefully more useful than what I previously attached.

f2-mds2_lustre_unhealthy_20190715.tgz

Comment by Matt Ezell [ 15/Jul/19 ]

Is this something that https://github.com/zfsonlinux/zfs/issues/8426 might help with?

Comment by James A Simmons [ 15/Jul/19 ]

I patched our 0.7.13 ZFS version with the patch that 8426 referenced too.

Comment by Peter Jones [ 21/Aug/19 ]

@James is it too early to tell whether the ZFS patch helped?

Comment by James A Simmons [ 21/Aug/19 ]

We since have move to ZFS 0.8.1 which has less problems. Matt do you okay with closing this ticket?

Comment by Matt Ezell [ 22/Aug/19 ]

Yes, I think it's fine to close this one.

Comment by Peter Jones [ 22/Aug/19 ]

ok thanks!

Generated at Sat Feb 10 02:53:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.