[LU-16224] rw_seq_cst_vs_drop_caches dies with SIGBUS Created: 07/Oct/22 Updated: 26/May/23 Resolved: 20/May/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | hxr, llnl | ||
| Environment: |
lustre-2.15.1_5.llnl |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Running the reproducer from [root@mutt21:toss-5803-sigbus]# ./run_test /p/olaf{a,b}/faaland1/test/sigbustest
++ ./rw_seq_cst_vs_drop_caches /p/olafa/faaland1/test/sigbustest /p/olafb/faaland1/test/sigbustest
u = 60, v = { 60, 59 }
./run_test: line 11: 120055 Aborted (core dumped) ./rw_seq_cst_vs_drop_caches $1 $2
++ status=134
++ signum=6
++ case $signum in
++ echo FAIL with SIGBUS
FAIL with SIGBUS
Although it's not yet confirmed to be the same issue, we have two users reporting jobs dying with a bus error intermittently, when using Lustre for I/O, which is what prompted me to run this against Lustre 2.15.1. |
| Comments |
| Comment by Olaf Faaland [ 07/Oct/22 ] |
|
For our reference, our local ticket is TOSS5803 |
| Comment by Peter Jones [ 07/Oct/22 ] |
|
Bobijam Could you please investigate? Thanks Peter |
| Comment by Zhenyu Xu [ 08/Oct/22 ] |
|
Can you apply this patch https://review.whamcloud.com/#/c/fs/lustre-release/+/48607/ and try the reproducer? It contains the fix concerning about the SIGBUS issue. |
| Comment by Peter Jones [ 08/Oct/22 ] |
|
Bobijam Why don't you port the patch to b2_15 to make it easier for testing purposes? Peter |
| Comment by Zhenyu Xu [ 08/Oct/22 ] |
|
Here are the ports of the patches for SIGBUS issue. https://review.whamcloud.com/c/fs/lustre-release/+/48804 |
| Comment by Olaf Faaland [ 12/Oct/22 ] |
|
Hi Bobijam, I pulled those changes (48804 and 48805) into our 2.15.1-based patch stack and confirmed that rw_seq_cst_vs_drop_caches runs successfully now. I gave both changes my +1 to reflect that. rw_seq_cst_vs_drop_caches fails on our 2.12.9-based clients as well. It looks to me like the two Are those three patches the right ones for 2.12 to address the issue there? thanks |
| Comment by Zhenyu Xu [ 13/Oct/22 ] |
|
thank you for the confirmation. Yes, besides |
| Comment by Olaf Faaland [ 19/Oct/22 ] |
|
Hi Bobijam, What is the status of the two backports? I saw that there was a review question for one, and the other seems to have a build issue. thanks, |
| Comment by Zhenyu Xu [ 20/Oct/22 ] |
|
A revised patch which is trying to address the review question is under review, when it's passed I'd update the backports. |
| Comment by Xing Huang [ 12/Nov/22 ] |
|
2022-11-12: The b2_15 patch of |
| Comment by Olaf Faaland [ 14/Nov/22 ] |
|
Thank you for the update |
| Comment by Xing Huang [ 03/Dec/22 ] |
|
2022-12-03: The b2_15 patch of |
| Comment by Olaf Faaland [ 13/Dec/22 ] |
|
Hi, do you have an update for this issue? It is creating problems for at least two users. Thanks |
| Comment by Peter Jones [ 03/Jan/23 ] |
|
Bobijam Do I understand correctly you intend to port https://review.whamcloud.com/c/fs/lustre-release/+/49534 to b2_15 and then ask LLNL to use that in their reproducer? Peter |
| Comment by Zhenyu Xu [ 04/Jan/23 ] |
|
yes, I'd port it to b2_15 at https://review.whamcloud.com/c/fs/lustre-release/+/49553 |
| Comment by Peter Jones [ 16/Jan/23 ] |
|
Olaf Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch? Peter |
| Comment by Olaf Faaland [ 23/Jan/23 ] |
|
Peter, My 2 users hitting what I suspect to be the same issue are running against older Lustre versions, 2.12 clients and 2.14 servers. When Bobijam's backport allows rw_seq_cst_vs_drop_caches.c to succeed reliably, and passes the usual automated tests, then I'll pull it into our 2.15 branch and work on getting my users to run on our 2.15 machines. Right now the backport has a -1 from Maloo. thanks |
| Comment by Patrick Farrell [ 23/Jan/23 ] |
|
Olaf, Please try https://review.whamcloud.com/c/fs/lustre-release/+/49647/ and https://review.whamcloud.com/c/fs/lustre-release/+/49653/ (both, applied in that order) the first is a similar but simpler fix for this issue, the second is a fix for a possible data inconsistency exposed (though not caused) by the first. We will probably be taking this in preference to Bobi's patch. (The Maloo failures on those are unrelated to the patches, I'm just waiting for review, etc, before retriggering testing.) By the way, I've been able to reproduce this issue locally using the same sanity test Olaf referenced. For me it's only reliably hittable in the presence of memory pressure, but that seems to be timing, since it shouldn't be a hard requirement for the bug to occur. (Though might be involved in the real applications hitting the issue.) |
| Comment by Xing Huang [ 01/Apr/23 ] |
|
2023-04-01: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is being reviewed. |
| Comment by Xing Huang [ 08/Apr/23 ] |
|
2023-04-08: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is ready to land to master(in master-next branch). |
| Comment by Xing Huang [ 08/May/23 ] |
|
2023-05-08: Both the two patches provided to LLNL for test landed(#49647 #49653) to master. |
| Comment by Peter Jones [ 13/May/23 ] |
|
Both patches ready to land for b2_15. |
| Comment by Xing Huang [ 20/May/23 ] |
|
2023-05-20: Both patches landed to b2_15. |
| Comment by Peter Jones [ 20/May/23 ] |
|
Fix provided in upcoming 2.15.3 release. Marking as duplicate but will only remove topllnl label when LLNL have confirmed effectiveness of fixes. |