>> The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),
>Where? On what branch?
The site is ECMWF, and it was done on a 2.12.6 basis. I unfortunately don't have the whole history as I was not involved in these at the time and this was handled between the local team and DDN directly.
>> but we recently found that it was causing the corruptions on mmap'ed pages.
>Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?
Reverting it caused the corruption, or at least we get the corruptions with it reverted. As we only understood recently that the crash/hang issues we observed were related to in-memory data corruption, it is actually only an assumption that the revert caused it.
There is a DDN support case open for this.
>> So, yes please can we find another solution ?
>There seems to be a lot of hopeful use of "we" in this ticket.
Obviously, this last "we" was probably more a "you". As much as I would like to provide solutions, my understanding of the memory management subsystem is too limited for this.
Regards,
Sebastien.
Resolved by revert of
LU-14541. SeeLU-14541for further discussion.