Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Lustre: 2.15.0_RC3, zfs 2.0.7 (both self-compiled)
OS: Centos 8.5, kernel 4.18.0-348.7.1.el8_5.x86_64
-
3
-
9223372036854775807
Description
We have experienced locks over the past few weeks on OSS based on ZFS 2.0.7, which makes the node unresponsive in terms of Lustre (OSS node goes unhealthy) and causes a huge load (>400) on OSS. In some situations, directly after that, the load on MDS also increases, but it seems like a consequence of lost communication between MDS and affected OSS. We cannot associate this problem with the exact IO pattern or type of operation. We first address this problem here, but we cannot exclude that it should be addressed to ZFS developers - if you consider it, please let us know. We attach two types of logs: the first from the 16th of October when both MDS and OSS were affected and the second from the 13th of November when only OSS was stuck. If you need more information, please don't hesitate to let us know.
Regards
Dominika Wanat