[LU-12510] mds server hangs cv_wait_common Created: 05/Jul/19  Updated: 08/May/21  Resolved: 24/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: Dustin Leverman Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: LLNL, ORNL
Environment:

RHEL7.6, lustre-2.12.2, ZFS-0.8.1-1


Attachments: File LU12510.tar.gz     File f2-mds4_20190709.tgz     File f2-mds4_lustre_unhealthy_20190707.tar.gz    
Issue Links:
Duplicate
is duplicated by LU-12544 mds marked unhealthy after txg_quiesc... Resolved
Related
is related to LU-10250 replay-single test_74: hang and time... Open
Severity: 2
Epic: server
Rank (Obsolete): 9223372036854775807

 Description   

We are hitting an issue after the upgrade to lustre-2.12.2 last weekend where the MDS servers will start to hang, and the lustre_health check will report NOT HEALTHY because of long waiting requests. We think this is related to LU-10250, but wanted to create a seperate ticket to track this to push the priority up. It is taking an MDS down about once a day causing production outages.

2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was l

ost; in progress operations using this service will wait for recovery to complete

2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)

2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.

2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce     D ffff99aec932a080     0 40218      2 0x00000000

2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc863b>] txg_quiesce_thread+0x2cb/0x3c0 [zfs]

2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0bc8370>] ? txg_init+0x2b0/0x2b0 [zfs]

2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb93>] thread_generic_wrapper+0x73/0x80 [spl]

2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc099cb20>] ? __thread_exit+0x20/0x20 [spl]

2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1c71>] kthread+0xd1/0xe0

2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83974c1d>] ret_from_fork_nospec_begin+0x7/0x21

2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c1ba0>] ? insert_kthread_work+0x40/0x40

2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.

2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000       D ffff99aecc7ae180     0 42947      2 0x00000000

2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:

2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff83967c49>] schedule+0x29/0x70

2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995325>] cv_wait_common+0x125/0x150 [spl]

2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [<ffffffff832c2d40>] ? wake_up_atomic_t+0x30/0x30

2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0995365>] __cv_wait+0x15/0x20 [spl]

2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b675fb>] dmu_tx_wait+0x20b/0x3b0 [zfs]

2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc0b67831>] dmu_tx_assign+0x91/0x490 [zfs]

2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc176d049>] osd_trans_start+0x199/0x440 [osd_zfs]

2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18ce837>] mdt_empty_transno+0xf7/0x850 [mdt]

2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d1eee>] mdt_mfd_open+0x8de/0xe70 [mdt]

2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18a4ba2>] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]

2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [<ffffffffc18d2acb>] mdt_finish_open+0x64b/0x760 [mdt]



 Comments   
Comment by Dustin Leverman [ 05/Jul/19 ]

We are working on fixing an issue with kdump, and when this crashes again tonight, we plan on capturing a kdump.

 

Thanks,

Dustin

Comment by James A Simmons [ 05/Jul/19 ]

Opened https://github.com/zfsonlinux/zfs/issues/8994 for the ZFS guys as well.

Comment by Peter Jones [ 05/Jul/19 ]

Alex

Could you please investigate?

Thanks

Peter

Comment by Alex Zhuravlev [ 05/Jul/19 ]

I'm dowloading the archive.. will take some time. there must be another thread stuck somewhere else, but holding TXG open (and preventing new TXG). it would be great to have all backtraces and/or the crashdump.

Comment by Dustin Leverman [ 05/Jul/19 ]

Thanks Alex!

 

I plan on gathering the crashdump next time this happens along with some perf top tracing of the txg_quiesce process.

 

-Dustin

Comment by Andreas Dilger [ 05/Jul/19 ]

Dustin, grabbing "echo t > /proc/sysrq-trigger" would be useful when the threads are stuck, before you trigger the crashdump, since that will be easier/possible to attach to the ticket and may provide some information about why they are stuck.

Comment by Dustin Leverman [ 05/Jul/19 ]

I'll make sure to grab that as well. Thanks!

Comment by Peter Jones [ 05/Jul/19 ]

BTW do you have any patches applied to 2.12.2 or are you just running the vanilla release?

Comment by James A Simmons [ 07/Jul/19 ]

We are running  vanilla 2.12 at git commit 

v2_12_2-30-g989217d.

with ZFS 0.8.1 release

Comment by Dustin Leverman [ 08/Jul/19 ]

We had another crash last night, and I gathered some debug data: kernel traces from sysrq, kernel logs, perf record data, and a couple of other things. There is still an outstanding issue with kdump, so I wasn't able to get the crashdump.

Please let us know if there are additional things you would like for us to gather during the next crash.

 

Thanks,

Dustin

f2-mds4_lustre_unhealthy_20190707.tar.gz

Comment by Philip B Curtis [ 09/Jul/19 ]

We encountered this issue again on this same mds. We were able to resolve the issue with the crashdump in between and I have gathered some other recommended debug info.

 

Thanks,

Philip

f2-mds4_20190709.tgz

Comment by Alex Zhuravlev [ 09/Jul/19 ]
[ 3857.825288]  [<ffffffffc12937a3>] dnode_hold_impl+0x873/0xcb0 [zfs]
[ 3857.832059]  [<ffffffffc127d256>] dmu_object_alloc_impl+0x216/0x400 [zfs]
[ 3857.839339]  [<ffffffffc127d4f9>] dmu_object_alloc_dnsize+0x39/0x40 [zfs]
[ 3857.846591]  [<ffffffffc18b9662>] __osd_object_create+0x82/0x170 [osd_zfs]
[ 3857.853934]  [<ffffffffc18b99cd>] osd_mkreg+0x7d/0x210 [osd_zfs]
[ 3857.860402]  [<ffffffffc18b5cd6>] osd_create+0x346/0xb20 [osd_zfs]
[ 3857.867040]  [<ffffffffc1acd7e5>] lod_sub_create+0x1f5/0x480 [lod]
[ 3857.873688]  [<ffffffffc1abe049>] lod_create+0x69/0x340 [lod]
[ 3857.879895]  [<ffffffffc1b37ef8>] mdd_create_object_internal+0xb8/0x280 [mdd]
[ 3857.887486]  [<ffffffffc1b217e5>] mdd_create_object+0x75/0x820 [mdd]
[ 3857.894292]  [<ffffffffc1b2bd90>] mdd_create+0xe00/0x14b0 [mdd]
[ 3857.900663]  [<ffffffffc19faf60>] mdt_reint_open+0x19d0/0x27d0 [mdt]
[ 3857.907475]  [<ffffffffc19edfa3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 3857.913933]  [<ffffffffc19cc1b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[ 3857.920902]  [<ffffffffc19d8972>] mdt_intent_open+0x82/0x3a0 [mdt]
[ 3857.927524]  [<ffffffffc19d6a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[ 3857.934406]  [<ffffffffc167dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[ 3857.941582]  [<ffffffffc16a6587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[ 3857.949097]  [<ffffffffc172e882>] tgt_enqueue+0x62/0x210 [ptlrpc]

seem to be related to https://github.com/zfsonlinux/zfs/issues/8994

Comment by Dustin Leverman [ 09/Jul/19 ]

Alex, 

         James created a ticket with the ZFS developers as well to get their eyes on it in addition to yours.

 

-Dustin

Comment by James A Simmons [ 09/Jul/19 ]

Yes its the same issue. I have a earlier comment  that states I opened that ticket on zfsforlinux issue tracker.

Comment by James A Simmons [ 10/Jul/19 ]

So a potential work around is setting dnodesize=legacy. Any opinions on this? Any know issues if this is done?

Comment by Dustin Leverman [ 10/Jul/19 ]

To add to what James is asking in the previous message, we are considering testing the recommendation from the ZFS folks (Brian) to set  dnodesize=legacy to reduce lock contention. We wanted to run this by Whamcloud first since you are hold our support contract.

 

-Dustin

Comment by Philip B Curtis [ 10/Jul/19 ]

As we have just hit this issue again, it would be a good time to test this change if it is safe to make and see if it keeps it from happening.

-Philip

Comment by Brian Behlendorf [ 10/Jul/19 ]

> So a potential work around is setting dnodesize=legacy. Any opinions on this? Any know issues if this is done?

From the ZFS side the consequences of changing this would be that we don't store Lustre's xattrs along with the dnode on disk for newly allocated objects.  These objects (for as long as they exist) will incur an additional I/O to access the xattrs (even Lustre's internal ones).  For an OST that's going to be less of a performance limitation than on a MDT.  The additional tiny blocks to store the xattrs can contribute to fragmenting free space too.  There may be additional considerations at the Lustre level.

I suggested this as possible stop gap which doesn't require any code change until we have a proper fix.  I'm working on this now.

 

Comment by Philip B Curtis [ 11/Jul/19 ]

I updated the zfs issue, but I will update here as well. It looks like setting the dnodesize=legacy on the MDS hosts that were hitting this issue may not be working. We just hit it again on the f2-mds4 host.

 

One thought I had, is that dnodesize something that can be set on the fly or is it something that I would need to reimport to take effect? I did not do that last night when I set it.

Comment by Brian Behlendorf [ 11/Jul/19 ]

Alex, I believe I have spotted the problem.  Can you take a look at this to make sure I correctly understand the design and the analysis makes sense.

As I understand it the Lustre osd-zfs code always acquires the zrl lock for a dnode handle when it's created, via osd_create(), or when an existing object needs to be initialized, via osd_object_init().  It then doesn't release the zrl lock until osd_object_delete() is called to release the resources.  Which means the zrl lock in the dnode handle can, and likely will, be held for a long time.

Normally, this doesn't cause a problem since DB_DNODE_ENTER (which takes the  lock) is called after you have a pointer to the dnode and Lustre won't try and hold it again.  Doing so would result in exactly this deadlock if two process concurrently tried to create or initialize the same dnode.  This is the situation described by the comment in the DNODE_MUST_BE_ALLOCATED case of dnode_hold_impl().  However, that logic depends on the unwritten assumption that the dnode handle's zrl lock will only even be held long enough to access the dnh_dnode field.   As is the case in the rest of the ZFS code.

I'd suggest that the best way to resolve this issue is to update the osd-zfs so that it never directly manipulates the zrl locks.  The dnode_hold and dnode_rele functions are the intended interfaces to ensure that a pointer to a dnode remains valid.  It looks like perhaps DB_DNODE_ENTER and DB_DNODE_EXIT were used originally since the dnode_ functions were not exported.  In fact, dnode_rele() still doesn't appear to be exported but it definitely can be along with any other symbols we need.

Putting that change together will take some time, if a quicker fix is needed we could perhaps consider using the existing obj->oo_guard to prevent this.  Though, I'd expect that concurrent allocations of new objects would still be vulnerable.

Additionally, I did find one code path where osd_dnode_rele() is not called for an error condition in osd_mkreg().  In which case the zrl lock won't be released.  That would also explain the symptoms we're seeing and should be fixed, but I didn't find the matching "can't change blocksize:" CERROR in the console log, so I'm pretty sure we didn't hit this.

Comment by Alex Zhuravlev [ 12/Jul/19 ]

thanks, Brian. working on the patch...

Comment by Brian Behlendorf [ 12/Jul/19 ]

Alex I've opened PR https://github.com/zfsonlinux/zfs/pull/9027 which exports the dnode symbols you'll need, plus a few additional ones I thought you might find useful.  Let me know if there are additional symbols I should add to that PR.  A previous commit has already exported dmu_object_alloc_hold() and  zap_create_hold() which return the held dnode as part of the create.

Comment by James A Simmons [ 12/Jul/19 ]

We are trying to reproduce this in our test bed and currently we can't seem to reproduce this easily. Can you recommend a setup that would expose this issue easily. I think this is mostly due to the small scale we are running in the test bed.

Comment by Alex Zhuravlev [ 16/Jul/19 ]

please, try https://review.whamcloud.com/#/c/35524/

Comment by Brian Behlendorf [ 16/Jul/19 ]

James, I'd expect that a large parallel create and unlink workload on an aged mdt would have the best chance of hitting this.

Comment by James A Simmons [ 19/Jul/19 ]

So far patch 35524 is holding up well with our production file system. The current patch is a work around until a proper solution is done so don't close this ticket once the patch lands.

Comment by Patrick Farrell (Inactive) [ 22/Jul/19 ]

bzzz,

I'm wondering if you think the failure here:
https://testing.whamcloud.com/test_sets/612acaa2-ac97-11e9-a0be-52540065bddc

Which is due to the client being evicted from OST0, which happens because OST0 is stuck like this:

[ 9758.094925] Lustre: ll_ost00_016: service thread pid 2789 was inactive for 40.093 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 9767.054545] Lustre: ll_ost00_073: service thread pid 3361 was inactive for 40.069 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 9767.054551] Pid: 3233, comm: ll_ost00_055 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019
[ 9767.054551] Call Trace:
[ 9767.054645]  [<ffffffffc02ee2d5>] cv_wait_common+0x125/0x150 [spl]
[ 9767.054650]  [<ffffffffc02ee315>] __cv_wait+0x15/0x20 [spl]
[ 9767.054685]  [<ffffffffc04542bf>] txg_wait_synced+0xef/0x140 [zfs]
[ 9767.054736]  [<ffffffffc112c9ab>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
[ 9767.054771]  [<ffffffffc1270cd5>] ofd_trans_stop+0x25/0x60 [ofd]
[ 9767.054776]  [<ffffffffc12752c5>] ofd_destroy+0x2c5/0x960 [ofd]
[ 9767.054780]  [<ffffffffc126d534>] ofd_destroy_by_fid+0x1f4/0x4a0 [ofd]
[ 9767.054784]  [<ffffffffc1262057>] ofd_destroy_hdl+0x267/0x970 [ofd]
[ 9767.055141]  [<ffffffffc0f5e51a>] tgt_request_handle+0x91a/0x15c0 [ptlrpc]
[ 9767.055179]  [<ffffffffc0f03646>] ptlrpc_server_handle_request+0x256/0xb00 [ptlrpc]
[ 9767.055212]  [<ffffffffc0f0717c>] ptlrpc_main+0xbac/0x1560 [ptlrpc]
[ 9767.055252]  [<ffffffffa9cc1da1>] kthread+0xd1/0xe0
[ 9767.055280]  [<ffffffffaa375c37>] ret_from_fork_nospec_end+0x0/0x39
[ 9767.055288]  [<ffffffffffffffff>] 0xffffffffffffffff 

We've seen similar failures in a few tests recently, where the force_sync command apparently takes forever.  I'm guessing they have the same detailed signature as this.

Comment by Alex Zhuravlev [ 23/Jul/19 ]

pfarrell hard to say, but I'd not very likely, few reported processes weren't running, but got stuck with IO:

[16250.115937] Pid: 2829, comm: ll_ost00_029 3.10.0-957.21.3.el7_lustre.x86_64 #1 SMP Tue Jul 16 17:30:03 UTC 2019
[16250.117636] Call Trace:
[16250.118155]  [<ffffffffc02ee262>] cv_wait_common+0xb2/0x150 [spl]
[16250.119447]  [<ffffffffc02ee338>] __cv_wait_io+0x18/0x20 [spl]
[16250.120491]  [<ffffffffc04aca3b>] zio_wait+0x11b/0x1c0 [zfs]
[16250.121621]  [<ffffffffc03ebfce>] dbuf_read+0x67e/0x9e0 [zfs]
[16250.122689]  [<ffffffffc03f8060>] dmu_buf_hold_by_dnode+0x50/0x80 [zfs]
[16250.123910]  [<ffffffffc046b6c2>] zap_get_leaf_byblk.isra.11+0x92/0x260 [zfs]
[16250.125193]  [<ffffffffc046b978>] zap_deref_leaf+0xe8/0x100 [zfs]
[16250.126336]  [<ffffffffc046c9b2>] fzap_lookup+0x62/0x180 [zfs]
[16250.127455]  [<ffffffffc04712b1>] zap_lookup_impl+0xd1/0x200 [zfs]
[16250.128629]  [<ffffffffc0471c2e>] zap_lookup_norm+0x8e/0xd0 [zfs]
[16250.129773]  [<ffffffffc0471ca3>] zap_lookup+0x33/0x40 [zfs]
[16250.130823]  [<ffffffffc113d4c1>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
[16250.132040]  [<ffffffffc1135669>] osd_object_init+0x389/0xca0 [osd_zfs]
[16250.133221]  [<ffffffffc0c17b25>] lu_object_alloc+0xe5/0x320 [obdclass]
Comment by Gerrit Updater [ 24/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35524/
Subject: LU-12510 osd: osd-zfs to release zrlock quickly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 88b329ac2ab568a25251f3f7c3a7e0c7367cb36f

Comment by Gerrit Updater [ 24/Jul/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/35600
Subject: LU-12510 osd: osd-zfs to release zrlock quickly
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d844b66bf8d58708237547598d016b356a07a01d

Comment by Peter Jones [ 24/Jul/19 ]

Landed for 2.13

Comment by James A Simmons [ 25/Jul/19 ]

The above patch is only a work around. A more complete solution is needed with ZFS 0.8.2. We should either open another ticket or reopen this ticket.

Comment by Peter Jones [ 25/Jul/19 ]

James

Do you mean that further Lustre changes will be needed to adjust to the ZFS changes in 0.8.2?

Peter

Comment by Alex Zhuravlev [ 25/Jul/19 ]

I'll be working on a patch re-implementing this using new DMU API, but this can wait a bit, IMO.
the landed patch is fully functional, it just keep using not-best-approach to access DMU functionality.
that was made this way as part of performance improvements due to missing bits in DMU API.

Comment by Peter Jones [ 25/Jul/19 ]

In that case let's create another JIRA ticket to track that work and leave this as resolved and get the fix onto the LTS branch too (I think this is already in flight)

Comment by Gerrit Updater [ 30/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35600/
Subject: LU-12510 osd: osd-zfs to release zrlock quickly
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: ec6b9a60c87767236804865da51a5c1c425db01c

Comment by Gerrit Updater [ 06/Aug/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35707
Subject: LU-12510 osd: osd-zfs to release zrlock quickly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 678e24e01ea4427b84d6c4b2a4d679edf077d3cc

Generated at Sat Feb 10 02:53:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.