[LU-12544] mds marked unhealthy after txg_quiesce thread hanging - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
- ORNL
Environment:
RHEL7.6, Lustre 2.11.0, ZFS 0.7.13

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

While waiting for a resolution to ~~LU-12510~~, we rolled back to our previous production image which is at lustre 2.11.0 and using zfs 0.7.13. We had run into issues, but it was deemed more stable than what we were seeing. Since then, we have repeatedly been hitting this issue which is causing our MDS hosts to get marked unhealthy by lustre.

[12692.736688] INFO: task txg_quiesce:37482 blocked for more than 120 seconds.
[12692.744369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.752901] txg_quiesce     D ffff8afcb464c100     0 37482      2 0x00000000
[12692.760699] Call Trace:
[12692.763852]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.769527]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.776399]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.782916]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.789191]  [<ffffffffc129fc6b>] txg_quiesce_thread+0x2fb/0x410 [zfs]
[12692.796401]  [<ffffffffc129f970>] ? txg_init+0x2b0/0x2b0 [zfs]
[12692.802893]  [<ffffffffc0a0b063>] thread_generic_wrapper+0x73/0x80 [spl]
[12692.810243]  [<ffffffffc0a0aff0>] ? __thread_exit+0x20/0x20 [spl]
[12692.816976]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12692.822482]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.829195]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12692.836264]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12692.842995] INFO: task mdt01_001:38593 blocked for more than 120 seconds.
[12692.850408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12692.858865] mdt01_001       D ffff8afc5f579040     0 38593      2 0x00000000
[12692.866600] Call Trace:
[12692.869719]  [<ffffffffc19d4a19>] ? lod_sub_declare_xattr_set+0xf9/0x300 [lod]
[12692.877592]  [<ffffffff84b67c49>] schedule+0x29/0x70
[12692.883203]  [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl]
[12692.890015]  [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30
[12692.896475]  [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl]
[12692.902677]  [<ffffffffc1254c33>] dmu_tx_wait+0x213/0x3c0 [zfs]
[12692.909203]  [<ffffffffc1254e72>] dmu_tx_assign+0x92/0x490 [zfs]
[12692.915805]  [<ffffffffc0d35f57>] osd_trans_start+0xa7/0x3c0 [osd_zfs]
[12692.922968]  [<ffffffffc1841fa2>] top_trans_start+0x702/0x940 [ptlrpc]
[12692.930062]  [<ffffffffc1a2b173>] ? mdd_declare_create+0x5a3/0xdb0 [mdd]
[12692.937324]  [<ffffffffc199a3f1>] lod_trans_start+0x31/0x40 [lod]
[12692.943964]  [<ffffffffc1a4980a>] mdd_trans_start+0x1a/0x20 [mdd]
[12692.950591]  [<ffffffffc1a2f507>] mdd_create+0xb77/0x13a0 [mdd]
[12692.957049]  [<ffffffffc11146c8>] mdt_reint_open+0x2218/0x3270 [mdt]
[12692.963948]  [<ffffffffc15f4241>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass]
[12692.971950]  [<ffffffffc1108883>] mdt_reint_rec+0x83/0x210 [mdt]
[12692.978467]  [<ffffffffc10e81ab>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[12692.985486]  [<ffffffffc10f4737>] mdt_intent_reint+0x157/0x420 [mdt]
[12692.992317]  [<ffffffffc10eb315>] mdt_intent_opc+0x455/0xae0 [mdt]
[12692.999000]  [<ffffffffc17ca710>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
[12693.007136]  [<ffffffffc10f2f63>] mdt_intent_policy+0x1a3/0x360 [mdt]
[12693.014050]  [<ffffffffc177a235>] ldlm_lock_enqueue+0x385/0x8f0 [ptlrpc]
[12693.021223]  [<ffffffffc17a2913>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc]
[12693.028726]  [<ffffffffc17ca790>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[12693.036662]  [<ffffffffc1828bf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[12693.043205]  [<ffffffffc182f05a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[12693.050507]  [<ffffffffc17d4843>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[12693.058580]  [<ffffffffc17d16f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[12693.065766]  [<ffffffff844d6802>] ? default_wake_function+0x12/0x20
[12693.072431]  [<ffffffff844cbadb>] ? __wake_up_common+0x5b/0x90
[12693.078674]  [<ffffffffc17d7ff2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[12693.085351]  [<ffffffffc17d7560>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[12693.093130]  [<ffffffff844c1c71>] kthread+0xd1/0xe0
[12693.098403]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40
[12693.104890]  [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21
[12693.111718]  [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40

I wasn't sure if this is related to the same issue we were seeing with 2.12.2 and zfs 0.8.1. The dnodestats did not look like they were getting backed up on lock retries though.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

f2-mds2_lustre_unhealthy_20190713.tgz
18.53 MB
14/Jul/19 3:16 AM
f2-mds2_lustre_unhealthy_20190715.tgz
16.44 MB
15/Jul/19 11:15 AM

Issue Links

duplicates

LU-12510 mds server hangs cv_wait_common

Resolved

Activity

[LU-12544] mds marked unhealthy after txg_quiesce thread hanging

Peter Jones added a comment - 22/Aug/19 1:58 PM

ok thanks!

Peter Jones added a comment - 22/Aug/19 1:58 PM ok thanks!

Matt Ezell added a comment - 22/Aug/19 1:56 PM

Yes, I think it's fine to close this one.

Matt Ezell added a comment - 22/Aug/19 1:56 PM Yes, I think it's fine to close this one.

James A Simmons added a comment - 21/Aug/19 11:32 PM

We since have move to ZFS 0.8.1 which has less problems. Matt do you okay with closing this ticket?

James A Simmons added a comment - 21/Aug/19 11:32 PM We since have move to ZFS 0.8.1 which has less problems. Matt do you okay with closing this ticket?

Peter Jones added a comment - 21/Aug/19 10:00 PM

@James is it too early to tell whether the ZFS patch helped?

Peter Jones added a comment - 21/Aug/19 10:00 PM @James is it too early to tell whether the ZFS patch helped?

James A Simmons added a comment - 15/Jul/19 1:03 PM

I patched our 0.7.13 ZFS version with the patch that 8426 referenced too.

James A Simmons added a comment - 15/Jul/19 1:03 PM I patched our 0.7.13 ZFS version with the patch that 8426 referenced too.

Matt Ezell added a comment - 15/Jul/19 12:46 PM

Is this something that https://github.com/zfsonlinux/zfs/issues/8426 might help with?

Matt Ezell added a comment - 15/Jul/19 12:46 PM Is this something that https://github.com/zfsonlinux/zfs/issues/8426 might help with?

Philip B Curtis added a comment - 15/Jul/19 11:15 AM

We hit this issue again, just now. I am attaching the new data which is hopefully more useful than what I previously attached.

f2-mds2_lustre_unhealthy_20190715.tgz

Philip B Curtis added a comment - 15/Jul/19 11:15 AM We hit this issue again, just now. I am attaching the new data which is hopefully more useful than what I previously attached. f2-mds2_lustre_unhealthy_20190715.tgz

Alex Zhuravlev added a comment - 15/Jul/19 10:40 AM

it would be great to see all backtraces for the case. it looks like one (few) OSTs were down?
also, the trace in the description can't be found in the logs you attached. they look like a different cases?

Alex Zhuravlev added a comment - 15/Jul/19 10:40 AM it would be great to see all backtraces for the case. it looks like one (few) OSTs were down? also, the trace in the description can't be found in the logs you attached. they look like a different cases?

Peter Jones added a comment - 14/Jul/19 2:25 PM

Alex

Any advice here?

Peter

Peter Jones added a comment - 14/Jul/19 2:25 PM Alex Any advice here? Peter

Philip B Curtis added a comment - 14/Jul/19 3:17 AM

Attaching the dmesg output from the host before I ran a crashdump as well as a copy of the zfs kstats.

Philip B Curtis added a comment - 14/Jul/19 3:17 AM Attaching the dmesg output from the host before I ran a crashdump as well as a copy of the zfs kstats.

People

Assignee:: Alex Zhuravlev

Reporter:: Philip B Curtis

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Jul/19 2:25 AM

Updated:: 22/Aug/19 1:58 PM

Resolved:: 22/Aug/19 1:58 PM