[LU-16316] ZFS OSS locks Created: 16/Nov/22  Updated: 23/Dec/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dominika Wanat Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Lustre: 2.15.0_RC3, zfs 2.0.7 (both self-compiled)
OS: Centos 8.5, kernel 4.18.0-348.7.1.el8_5.x86_64


Attachments: File mds01_20221016.log     File oss03_20221113.log     File oss06_20221016.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have experienced locks over the past few weeks on OSS based on ZFS 2.0.7, which makes the node unresponsive in terms of Lustre (OSS node goes unhealthy) and causes a huge load (>400) on OSS. In some situations, directly after that, the load on MDS also increases, but it seems like a consequence of lost communication between MDS and affected OSS. We cannot associate this problem with the exact IO pattern or type of operation. We first address this problem here, but we cannot exclude that it should be addressed to ZFS developers - if you consider it, please let us know.  We attach two types of logs: the first from the 16th of October when both MDS and OSS were affected and the second from the 13th of November when only OSS was stuck. If you need more information, please don't hesitate to let us know.

 

Regards

 

Dominika Wanat 



 Comments   
Comment by Alex Zhuravlev [ 25/Nov/22 ]

In some situations, directly after that, the load on MDS also increases, but it seems like a consequence of lost communication between MDS and affected OSS.

correct, this is because MDS gets stuck awaiting for new objects from the problem OST.

I'm not 100% positive, but I found number of OST threads trying to prefetch data. you can try to disable prefetching to see whether it's related:

echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable

– on OSTs

Comment by Dominika Wanat [ 28/Nov/22 ]

Thanks for the hint. We are investigating the nodes with prefetch disabled. 

Comment by Dominika Wanat [ 20/Dec/22 ]

It looks like it helps - nodes have not hung since then. Do you consider fixing this behaviour of Lustre with ZFS prefetch enabled?

Comment by Alex Zhuravlev [ 23/Dec/22 ]

It looks like it helps - nodes have not hung since then. Do you consider fixing this behaviour of Lustre with ZFS prefetch enabled?

this is very workload specific thing.. we've seen number of reports where prefetching does improve performance.

Generated at Sat Feb 10 03:25:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.