[LU-12510] mds server hangs cv_wait_common Created: 05/Jul/19 Updated: 08/May/21 Resolved: 24/Jul/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Dustin Leverman | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LLNL, ORNL | ||
| Environment: |
RHEL7.6, lustre-2.12.2, ZFS-0.8.1-1 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Epic: | server | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We are hitting an issue after the upgrade to lustre-2.12.2 last weekend where the MDS servers will start to hang, and the lustre_health check will report NOT HEALTHY because of long waiting requests. We think this is related to LU-10250, but wanted to create a seperate ticket to track this to push the priority up. It is taking an MDS down about once a day causing production outages. 2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was l ost; in progress operations using this service will wait for recovery to complete 2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2) 2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds. 2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce D ffff99aec932a080 0 40218 2 0x00000000 2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace: 2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70 2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl] 2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30 2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl] 2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc863b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs] 2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc8370>] ? txg_init+0x2b0/0x2b0 [zfs] 2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb93>] thread_generic_wrapper+0x73/0x80 [spl] 2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb20>] ? __thread_exit+0x20/0x20 [spl] 2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1c71>] kthread+0xd1/0xe0 2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40 2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83974c1d>] ret_from_fork_nospec_begin+0x7/0x21 2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40 2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds. 2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000 D ffff99aecc7ae180 0 42947 2 0x00000000 2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace: 2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70 2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl] 2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30 2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl] 2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b675fb>] dmu_tx_wait+0x20b/0x3b0 [zfs] 2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b67831>] dmu_tx_assign+0x91/0x490 [zfs] 2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc176d049>] osd_trans_start+0x199/0x440 [osd_zfs] 2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18ce837>] mdt_empty_transno+0xf7/0x850 [mdt] 2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d1eee>] mdt_mfd_open+0x8de/0xe70 [mdt] 2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18a4ba2>] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt] 2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d2acb>] mdt_finish_open+0x64b/0x760 [mdt] |
| Comments |
| Comment by Dustin Leverman [ 05/Jul/19 ] |
|
We are working on fixing an issue with kdump, and when this crashes again tonight, we plan on capturing a kdump.
Thanks, Dustin |
| Comment by James A Simmons [ 05/Jul/19 ] |
|
Opened https://github.com/zfsonlinux/zfs/issues/8994 for the ZFS guys as well. |
| Comment by Peter Jones [ 05/Jul/19 ] |
|
Alex Could you please investigate? Thanks Peter |
| Comment by Alex Zhuravlev [ 05/Jul/19 ] |
|
I'm dowloading the archive.. will take some time. there must be another thread stuck somewhere else, but holding TXG open (and preventing new TXG). it would be great to have all backtraces and/or the crashdump. |
| Comment by Dustin Leverman [ 05/Jul/19 ] |
|
Thanks Alex!
I plan on gathering the crashdump next time this happens along with some perf top tracing of the txg_quiesce process.
-Dustin |
| Comment by Andreas Dilger [ 05/Jul/19 ] |
|
Dustin, grabbing "echo t > /proc/sysrq-trigger" would be useful when the threads are stuck, before you trigger the crashdump, since that will be easier/possible to attach to the ticket and may provide some information about why they are stuck. |
| Comment by Dustin Leverman [ 05/Jul/19 ] |
|
I'll make sure to grab that as well. Thanks! |
| Comment by Peter Jones [ 05/Jul/19 ] |
|
BTW do you have any patches applied to 2.12.2 or are you just running the vanilla release? |
| Comment by James A Simmons [ 07/Jul/19 ] |
|
We are running vanilla 2.12 at git commit v2_12_2-30-g989217d. with ZFS 0.8.1 release |
| Comment by Dustin Leverman [ 08/Jul/19 ] |
|
We had another crash last night, and I gathered some debug data: kernel traces from sysrq, kernel logs, perf record data, and a couple of other things. There is still an outstanding issue with kdump, so I wasn't able to get the crashdump. Please let us know if there are additional things you would like for us to gather during the next crash.
Thanks, Dustin |
| Comment by Philip B Curtis [ 09/Jul/19 ] |
|
We encountered this issue again on this same mds. We were able to resolve the issue with the crashdump in between and I have gathered some other recommended debug info.
Thanks, Philip |
| Comment by Alex Zhuravlev [ 09/Jul/19 ] |
[ 3857.825288] [<ffffffffc12937a3>] dnode_hold_impl+0x873/0xcb0 [zfs] [ 3857.832059] [<ffffffffc127d256>] dmu_object_alloc_impl+0x216/0x400 [zfs] [ 3857.839339] [<ffffffffc127d4f9>] dmu_object_alloc_dnsize+0x39/0x40 [zfs] [ 3857.846591] [<ffffffffc18b9662>] __osd_object_create+0x82/0x170 [osd_zfs] [ 3857.853934] [<ffffffffc18b99cd>] osd_mkreg+0x7d/0x210 [osd_zfs] [ 3857.860402] [<ffffffffc18b5cd6>] osd_create+0x346/0xb20 [osd_zfs] [ 3857.867040] [<ffffffffc1acd7e5>] lod_sub_create+0x1f5/0x480 [lod] [ 3857.873688] [<ffffffffc1abe049>] lod_create+0x69/0x340 [lod] [ 3857.879895] [<ffffffffc1b37ef8>] mdd_create_object_internal+0xb8/0x280 [mdd] [ 3857.887486] [<ffffffffc1b217e5>] mdd_create_object+0x75/0x820 [mdd] [ 3857.894292] [<ffffffffc1b2bd90>] mdd_create+0xe00/0x14b0 [mdd] [ 3857.900663] [<ffffffffc19faf60>] mdt_reint_open+0x19d0/0x27d0 [mdt] [ 3857.907475] [<ffffffffc19edfa3>] mdt_reint_rec+0x83/0x210 [mdt] [ 3857.913933] [<ffffffffc19cc1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [ 3857.920902] [<ffffffffc19d8972>] mdt_intent_open+0x82/0x3a0 [mdt] [ 3857.927524] [<ffffffffc19d6a18>] mdt_intent_policy+0x2e8/0xd00 [mdt] [ 3857.934406] [<ffffffffc167dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] [ 3857.941582] [<ffffffffc16a6587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] [ 3857.949097] [<ffffffffc172e882>] tgt_enqueue+0x62/0x210 [ptlrpc] seem to be related to https://github.com/zfsonlinux/zfs/issues/8994 |
| Comment by Dustin Leverman [ 09/Jul/19 ] |
|
Alex, James created a ticket with the ZFS developers as well to get their eyes on it in addition to yours.
-Dustin |
| Comment by James A Simmons [ 09/Jul/19 ] |
|
Yes its the same issue. I have a earlier comment that states I opened that ticket on zfsforlinux issue tracker. |
| Comment by James A Simmons [ 10/Jul/19 ] |
|
So a potential work around is setting dnodesize=legacy. Any opinions on this? Any know issues if this is done? |
| Comment by Dustin Leverman [ 10/Jul/19 ] |
|
To add to what James is asking in the previous message, we are considering testing the recommendation from the ZFS folks (Brian) to set dnodesize=legacy to reduce lock contention. We wanted to run this by Whamcloud first since you are hold our support contract.
-Dustin |
| Comment by Philip B Curtis [ 10/Jul/19 ] |
|
As we have just hit this issue again, it would be a good time to test this change if it is safe to make and see if it keeps it from happening. -Philip |
| Comment by Brian Behlendorf [ 10/Jul/19 ] |
|
> So a potential work around is setting dnodesize=legacy. Any opinions on this? Any know issues if this is done? From the ZFS side the consequences of changing this would be that we don't store Lustre's xattrs along with the dnode on disk for newly allocated objects. These objects (for as long as they exist) will incur an additional I/O to access the xattrs (even Lustre's internal ones). For an OST that's going to be less of a performance limitation than on a MDT. The additional tiny blocks to store the xattrs can contribute to fragmenting free space too. There may be additional considerations at the Lustre level. I suggested this as possible stop gap which doesn't require any code change until we have a proper fix. I'm working on this now.
|
| Comment by Philip B Curtis [ 11/Jul/19 ] |
|
I updated the zfs issue, but I will update here as well. It looks like setting the dnodesize=legacy on the MDS hosts that were hitting this issue may not be working. We just hit it again on the f2-mds4 host.
One thought I had, is that dnodesize something that can be set on the fly or is it something that I would need to reimport to take effect? I did not do that last night when I set it. |
| Comment by Brian Behlendorf [ 11/Jul/19 ] |
|
Alex, I believe I have spotted the problem. Can you take a look at this to make sure I correctly understand the design and the analysis makes sense. As I understand it the Lustre osd-zfs code always acquires the zrl lock for a dnode handle when it's created, via osd_create(), or when an existing object needs to be initialized, via osd_object_init(). It then doesn't release the zrl lock until osd_object_delete() is called to release the resources. Which means the zrl lock in the dnode handle can, and likely will, be held for a long time. Normally, this doesn't cause a problem since DB_DNODE_ENTER (which takes the lock) is called after you have a pointer to the dnode and Lustre won't try and hold it again. Doing so would result in exactly this deadlock if two process concurrently tried to create or initialize the same dnode. This is the situation described by the comment in the DNODE_MUST_BE_ALLOCATED case of dnode_hold_impl(). However, that logic depends on the unwritten assumption that the dnode handle's zrl lock will only even be held long enough to access the dnh_dnode field. As is the case in the rest of the ZFS code. I'd suggest that the best way to resolve this issue is to update the osd-zfs so that it never directly manipulates the zrl locks. The dnode_hold and dnode_rele functions are the intended interfaces to ensure that a pointer to a dnode remains valid. It looks like perhaps DB_DNODE_ENTER and DB_DNODE_EXIT were used originally since the dnode_ functions were not exported. In fact, dnode_rele() still doesn't appear to be exported but it definitely can be along with any other symbols we need. Putting that change together will take some time, if a quicker fix is needed we could perhaps consider using the existing obj->oo_guard to prevent this. Though, I'd expect that concurrent allocations of new objects would still be vulnerable. Additionally, I did find one code path where osd_dnode_rele() is not called for an error condition in osd_mkreg(). In which case the zrl lock won't be released. That would also explain the symptoms we're seeing and should be fixed, but I didn't find the matching "can't change blocksize:" CERROR in the console log, so I'm pretty sure we didn't hit this. |
| Comment by Alex Zhuravlev [ 12/Jul/19 ] |
|
thanks, Brian. working on the patch... |
| Comment by Brian Behlendorf [ 12/Jul/19 ] |
|
Alex I've opened PR https://github.com/zfsonlinux/zfs/pull/9027 which exports the dnode symbols you'll need, plus a few additional ones I thought you might find useful. Let me know if there are additional symbols I should add to that PR. A previous commit has already exported dmu_object_alloc_hold() and zap_create_hold() which return the held dnode as part of the create. |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
We are trying to reproduce this in our test bed and currently we can't seem to reproduce this easily. Can you recommend a setup that would expose this issue easily. I think this is mostly due to the small scale we are running in the test bed. |
| Comment by Alex Zhuravlev [ 16/Jul/19 ] |
|
please, try https://review.whamcloud.com/#/c/35524/ |
| Comment by Brian Behlendorf [ 16/Jul/19 ] |
|
James, I'd expect that a large parallel create and unlink workload on an aged mdt would have the best chance of hitting this. |
| Comment by James A Simmons [ 19/Jul/19 ] |
|
So far patch 35524 is holding up well with our production file system. The current patch is a work around until a proper solution is done so don't close this ticket once the patch lands. |
| Comment by Patrick Farrell (Inactive) [ 22/Jul/19 ] |
|
bzzz, I'm wondering if you think the failure here: Which is due to the client being evicted from OST0, which happens because OST0 is stuck like this: [ 9758.094925] Lustre: ll_ost00_016: service thread pid 2789 was inactive for 40.093 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 9767.054545] Lustre: ll_ost00_073: service thread pid 3361 was inactive for 40.069 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 9767.054551] Pid: 3233, comm: ll_ost00_055 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019 [ 9767.054551] Call Trace: [ 9767.054645] [<ffffffffc02ee2d5>] cv_wait_common+0x125/0x150 [spl] [ 9767.054650] [<ffffffffc02ee315>] __cv_wait+0x15/0x20 [spl] [ 9767.054685] [<ffffffffc04542bf>] txg_wait_synced+0xef/0x140 [zfs] [ 9767.054736] [<ffffffffc112c9ab>] osd_trans_stop+0x53b/0x5e0 [osd_zfs] [ 9767.054771] [<ffffffffc1270cd5>] ofd_trans_stop+0x25/0x60 [ofd] [ 9767.054776] [<ffffffffc12752c5>] ofd_destroy+0x2c5/0x960 [ofd] [ 9767.054780] [<ffffffffc126d534>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd] [ 9767.054784] [<ffffffffc1262057>] ofd_destroy_hdl+0x267/0x970 [ofd] [ 9767.055141] [<ffffffffc0f5e51a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc] [ 9767.055179] [<ffffffffc0f03646>] ptlrpc_server_handle_request+0x256/0xb00 [ptlrpc] [ 9767.055212] [<ffffffffc0f0717c>] ptlrpc_main+0xbac/0x1560 [ptlrpc] [ 9767.055252] [<ffffffffa9cc1da1>] kthread+0xd1/0xe0 [ 9767.055280] [<ffffffffaa375c37>] ret_from_fork_nospec_end+0x0/0x39 [ 9767.055288] [<ffffffffffffffff>] 0xffffffffffffffff We've seen similar failures in a few tests recently, where the force_sync command apparently takes forever. I'm guessing they have the same detailed signature as this. |
| Comment by Alex Zhuravlev [ 23/Jul/19 ] |
|
pfarrell hard to say, but I'd not very likely, few reported processes weren't running, but got stuck with IO: [16250.115937] Pid: 2829, comm: ll_ost00_029 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019 [16250.117636] Call Trace: [16250.118155] [<ffffffffc02ee262>] cv_wait_common+0xb2/0x150 [spl] [16250.119447] [<ffffffffc02ee338>] __cv_wait_io+0x18/0x20 [spl] [16250.120491] [<ffffffffc04aca3b>] zio_wait+0x11b/0x1c0 [zfs] [16250.121621] [<ffffffffc03ebfce>] dbuf_read+0x67e/0x9e0 [zfs] [16250.122689] [<ffffffffc03f8060>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs] [16250.123910] [<ffffffffc046b6c2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs] [16250.125193] [<ffffffffc046b978>] zap_deref_leaf+0xe8/0x100 [zfs] [16250.126336] [<ffffffffc046c9b2>] fzap_lookup+0x62/0x180 [zfs] [16250.127455] [<ffffffffc04712b1>] zap_lookup_impl+0xd1/0x200 [zfs] [16250.128629] [<ffffffffc0471c2e>] zap_lookup_norm+0x8e/0xd0 [zfs] [16250.129773] [<ffffffffc0471ca3>] zap_lookup+0x33/0x40 [zfs] [16250.130823] [<ffffffffc113d4c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs] [16250.132040] [<ffffffffc1135669>] osd_object_init+0x389/0xca0 [osd_zfs] [16250.133221] [<ffffffffc0c17b25>] lu_object_alloc+0xe5/0x320 [obdclass] |
| Comment by Gerrit Updater [ 24/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35524/ |
| Comment by Gerrit Updater [ 24/Jul/19 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35600 |
| Comment by Peter Jones [ 24/Jul/19 ] |
|
Landed for 2.13 |
| Comment by James A Simmons [ 25/Jul/19 ] |
|
The above patch is only a work around. A more complete solution is needed with ZFS 0.8.2. We should either open another ticket or reopen this ticket. |
| Comment by Peter Jones [ 25/Jul/19 ] |
|
James Do you mean that further Lustre changes will be needed to adjust to the ZFS changes in 0.8.2? Peter |
| Comment by Alex Zhuravlev [ 25/Jul/19 ] |
|
I'll be working on a patch re-implementing this using new DMU API, but this can wait a bit, IMO. |
| Comment by Peter Jones [ 25/Jul/19 ] |
|
In that case let's create another JIRA ticket to track that work and leave this as resolved and get the fix onto the LTS branch too (I think this is already in flight) |
| Comment by Gerrit Updater [ 30/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35600/ |
| Comment by Gerrit Updater [ 06/Aug/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35707 |