[LU-12544] mds marked unhealthy after txg_quiesce thread hanging Created: 14/Jul/19 Updated: 22/Aug/19 Resolved: 22/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Philip B Curtis | Assignee: | Alex Zhuravlev |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | ORNL | ||
| Environment: |
RHEL7.6, Lustre 2.11.0, ZFS 0.7.13 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
While waiting for a resolution to [12692.736688] INFO: task txg_quiesce:37482 blocked for more than 120 seconds. [12692.744369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12692.752901] txg_quiesce D ffff8afcb464c100 0 37482 2 0x00000000 [12692.760699] Call Trace: [12692.763852] [<ffffffff84b67c49>] schedule+0x29/0x70 [12692.769527] [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl] [12692.776399] [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30 [12692.782916] [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl] [12692.789191] [<ffffffffc129fc6b>] txg_quiesce_thread+0x2fb/0x410 [zfs] [12692.796401] [<ffffffffc129f970>] ? txg_init+0x2b0/0x2b0 [zfs] [12692.802893] [<ffffffffc0a0b063>] thread_generic_wrapper+0x73/0x80 [spl] [12692.810243] [<ffffffffc0a0aff0>] ? __thread_exit+0x20/0x20 [spl] [12692.816976] [<ffffffff844c1c71>] kthread+0xd1/0xe0 [12692.822482] [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 [12692.829195] [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21 [12692.836264] [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 [12692.842995] INFO: task mdt01_001:38593 blocked for more than 120 seconds. [12692.850408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12692.858865] mdt01_001 D ffff8afc5f579040 0 38593 2 0x00000000 [12692.866600] Call Trace: [12692.869719] [<ffffffffc19d4a19>] ? lod_sub_declare_xattr_set+0xf9/0x300 [lod] [12692.877592] [<ffffffff84b67c49>] schedule+0x29/0x70 [12692.883203] [<ffffffffc0a102d5>] cv_wait_common+0x125/0x150 [spl] [12692.890015] [<ffffffff844c2d40>] ? wake_up_atomic_t+0x30/0x30 [12692.896475] [<ffffffffc0a10315>] __cv_wait+0x15/0x20 [spl] [12692.902677] [<ffffffffc1254c33>] dmu_tx_wait+0x213/0x3c0 [zfs] [12692.909203] [<ffffffffc1254e72>] dmu_tx_assign+0x92/0x490 [zfs] [12692.915805] [<ffffffffc0d35f57>] osd_trans_start+0xa7/0x3c0 [osd_zfs] [12692.922968] [<ffffffffc1841fa2>] top_trans_start+0x702/0x940 [ptlrpc] [12692.930062] [<ffffffffc1a2b173>] ? mdd_declare_create+0x5a3/0xdb0 [mdd] [12692.937324] [<ffffffffc199a3f1>] lod_trans_start+0x31/0x40 [lod] [12692.943964] [<ffffffffc1a4980a>] mdd_trans_start+0x1a/0x20 [mdd] [12692.950591] [<ffffffffc1a2f507>] mdd_create+0xb77/0x13a0 [mdd] [12692.957049] [<ffffffffc11146c8>] mdt_reint_open+0x2218/0x3270 [mdt] [12692.963948] [<ffffffffc15f4241>] ? upcall_cache_get_entry+0x211/0x8d0 [obdclass] [12692.971950] [<ffffffffc1108883>] mdt_reint_rec+0x83/0x210 [mdt] [12692.978467] [<ffffffffc10e81ab>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [12692.985486] [<ffffffffc10f4737>] mdt_intent_reint+0x157/0x420 [mdt] [12692.992317] [<ffffffffc10eb315>] mdt_intent_opc+0x455/0xae0 [mdt] [12692.999000] [<ffffffffc17ca710>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] [12693.007136] [<ffffffffc10f2f63>] mdt_intent_policy+0x1a3/0x360 [mdt] [12693.014050] [<ffffffffc177a235>] ldlm_lock_enqueue+0x385/0x8f0 [ptlrpc] [12693.021223] [<ffffffffc17a2913>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc] [12693.028726] [<ffffffffc17ca790>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [12693.036662] [<ffffffffc1828bf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [12693.043205] [<ffffffffc182f05a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc] [12693.050507] [<ffffffffc17d4843>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc] [12693.058580] [<ffffffffc17d16f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [12693.065766] [<ffffffff844d6802>] ? default_wake_function+0x12/0x20 [12693.072431] [<ffffffff844cbadb>] ? __wake_up_common+0x5b/0x90 [12693.078674] [<ffffffffc17d7ff2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [12693.085351] [<ffffffffc17d7560>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [12693.093130] [<ffffffff844c1c71>] kthread+0xd1/0xe0 [12693.098403] [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 [12693.104890] [<ffffffff84b74c1d>] ret_from_fork_nospec_begin+0x7/0x21 [12693.111718] [<ffffffff844c1ba0>] ? insert_kthread_work+0x40/0x40 I wasn't sure if this is related to the same issue we were seeing with 2.12.2 and zfs 0.8.1. The dnodestats did not look like they were getting backed up on lock retries though. |
| Comments |
| Comment by Philip B Curtis [ 14/Jul/19 ] |
|
Attaching the dmesg output from the host before I ran a crashdump as well as a copy of the zfs kstats. |
| Comment by Peter Jones [ 14/Jul/19 ] |
|
Alex Any advice here? Peter |
| Comment by Alex Zhuravlev [ 15/Jul/19 ] |
|
it would be great to see all backtraces for the case. it looks like one (few) OSTs were down? |
| Comment by Philip B Curtis [ 15/Jul/19 ] |
|
We hit this issue again, just now. I am attaching the new data which is hopefully more useful than what I previously attached. |
| Comment by Matt Ezell [ 15/Jul/19 ] |
|
Is this something that https://github.com/zfsonlinux/zfs/issues/8426 might help with? |
| Comment by James A Simmons [ 15/Jul/19 ] |
|
I patched our 0.7.13 ZFS version with the patch that 8426 referenced too. |
| Comment by Peter Jones [ 21/Aug/19 ] |
|
@James is it too early to tell whether the ZFS patch helped? |
| Comment by James A Simmons [ 21/Aug/19 ] |
|
We since have move to ZFS 0.8.1 which has less problems. Matt do you okay with closing this ticket? |
| Comment by Matt Ezell [ 22/Aug/19 ] |
|
Yes, I think it's fine to close this one. |
| Comment by Peter Jones [ 22/Aug/19 ] |
|
ok thanks! |