Details
Description
Running the reproducer from LU-14541 (rw_seq_cst_vs_drop_caches.c) fails about 50% of the time with Lustre 2.15.1 (both client and servers).
[root@mutt21:toss-5803-sigbus]# ./run_test /p/olaf{a,b}/faaland1/test/sigbustest ++ ./rw_seq_cst_vs_drop_caches /p/olafa/faaland1/test/sigbustest /p/olafb/faaland1/test/sigbustest u = 60, v = { 60, 59 } ./run_test: line 11: 120055 Aborted (core dumped) ./rw_seq_cst_vs_drop_caches $1 $2 ++ status=134 ++ signum=6 ++ case $signum in ++ echo FAIL with SIGBUS FAIL with SIGBUS
Although it's not yet confirmed to be the same issue, we have two users reporting jobs dying with a bus error intermittently, when using Lustre for I/O, which is what prompted me to run this against Lustre 2.15.1.
Olaf,
Please try https://review.whamcloud.com/c/fs/lustre-release/+/49647/ and https://review.whamcloud.com/c/fs/lustre-release/+/49653/ (both, applied in that order) the first is a similar but simpler fix for this issue, the second is a fix for a possible data inconsistency exposed (though not caused) by the first. We will probably be taking this in preference to Bobi's patch. (The Maloo failures on those are unrelated to the patches, I'm just waiting for review, etc, before retriggering testing.)
By the way, I've been able to reproduce this issue locally using the same sanity test Olaf referenced. For me it's only reliably hittable in the presence of memory pressure, but that seems to be timing, since it shouldn't be a hard requirement for the bug to occur. (Though might be involved in the real applications hitting the issue.)