>> The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),
>Where? On what branch?
The site is ECMWF, and it was done on a 2.12.6 basis. I unfortunately don't have the whole history as I was not involved in these at the time and this was handled between the local team and DDN directly.
>> but we recently found that it was causing the corruptions on mmap'ed pages.
>Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?
Reverting it caused the corruption, or at least we get the corruptions with it reverted. As we only understood recently that the crash/hang issues we observed were related to in-memory data corruption, it is actually only an assumption that the revert caused it.
There is a DDN support case open for this.
>> So, yes please can we find another solution ?
>There seems to be a lot of hopeful use of "we" in this ticket.
Obviously, this last "we" was probably more a "you". As much as I would like to provide solutions, my understanding of the memory management subsystem is too limited for this.
Regards,
Sebastien.
So, we landed a patch from John here to resolve the SIGBUS issue by removing the clearpageuptodate() call in vvp_page_delete, ie, reverting
LU-14541. I have more comments on the SIGBUS issue, etc, which I'll put there, but basically, I think this is correct - It's clear from the page fault code in the kernel that we can't unset pageuptodate() without causing problems. (I was wrong about this previously.)So, that means the SIGBUS issue is resolved, but we have to find another way to solve
LU-14541.So, I'm going to close this ticket as resolved, and let's move the discussion to
LU-14541. panda, I've added you as a watcher there. I have a few thoughts on it but I'm still hoping you and/or Shadow have a good idea...