[LU-16224] rw_seq_cst_vs_drop_caches dies with SIGBUS Created: 07/Oct/22  Updated: 26/May/23  Resolved: 20/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: hxr, llnl
Environment:

lustre-2.15.1_5.llnl
4.18.0-372.26.1.1toss.t4.x86_64


Issue Links:
Related
is related to LU-16064 RPC from evicted client can corrupt data In Progress
is related to LU-16160 take ldlm lock when queue sync pages Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running the reproducer from LU-14541 (rw_seq_cst_vs_drop_caches.c) fails about 50% of the time with Lustre 2.15.1 (both client and servers).

[root@mutt21:toss-5803-sigbus]# ./run_test /p/olaf{a,b}/faaland1/test/sigbustest
++ ./rw_seq_cst_vs_drop_caches /p/olafa/faaland1/test/sigbustest /p/olafb/faaland1/test/sigbustest
u = 60, v = { 60, 59 }
./run_test: line 11: 120055 Aborted                 (core dumped) ./rw_seq_cst_vs_drop_caches $1 $2
++ status=134
++ signum=6
++ case $signum in
++ echo FAIL with SIGBUS
FAIL with SIGBUS

Although it's not yet confirmed to be the same issue, we have two users reporting jobs dying with a bus error intermittently, when using Lustre for I/O, which is what prompted me to run this against Lustre 2.15.1.



 Comments   
Comment by Olaf Faaland [ 07/Oct/22 ]

For our reference, our local ticket is TOSS5803

Comment by Peter Jones [ 07/Oct/22 ]

Bobijam

Could you please investigate?

Thanks

Peter

Comment by Zhenyu Xu [ 08/Oct/22 ]

Can you apply this patch https://review.whamcloud.com/#/c/fs/lustre-release/+/48607/ and try the reproducer? It contains the fix concerning about the SIGBUS issue.

Comment by Peter Jones [ 08/Oct/22 ]

Bobijam

Why don't you port the patch to b2_15 to make it easier for testing purposes? 

Peter

Comment by Zhenyu Xu [ 08/Oct/22 ]

Here are the ports of the patches for SIGBUS issue.

https://review.whamcloud.com/c/fs/lustre-release/+/48804 LU-16160 llite: clear stale page's uptodate bit
https://review.whamcloud.com/c/fs/lustre-release/+/48805 LU-16160 llite: clear page uptodate bit on cache drop

Comment by Olaf Faaland [ 12/Oct/22 ]

Hi Bobijam,

I pulled those changes (48804 and 48805) into our 2.15.1-based patch stack and confirmed that rw_seq_cst_vs_drop_caches runs successfully now. I gave both changes my +1 to reflect that.

rw_seq_cst_vs_drop_caches fails on our 2.12.9-based clients as well. It looks to me like the two LU-16160 patches depend on a third patch (LU-14541 llite: Check vmpage in releasepage) which Etienne backported to b2_12 in change https://review.whamcloud.com/48311 but which never got reviews or was landed.

Are those three patches the right ones for 2.12 to address the issue there?

thanks

Comment by Zhenyu Xu [ 13/Oct/22 ]

thank you for the confirmation. Yes, besides LU-16160 patches, I also think LU-16064 is another ticket addressing the read inconsistency issue.

Comment by Olaf Faaland [ 19/Oct/22 ]

Hi Bobijam,

What is the status of the two backports? I saw that there was a review question for one, and the other seems to have a build issue.

thanks,
Olaf

Comment by Zhenyu Xu [ 20/Oct/22 ]

A revised patch which is trying to address the review question is under review, when it's passed I'd update the backports.

Comment by Xing Huang [ 12/Nov/22 ]

2022-11-12: The b2_15 patch of LU-16160 is being updated according to master one. The master patch of LU-16064 needs to be rebased.

Comment by Olaf Faaland [ 14/Nov/22 ]

Thank you for the update

Comment by Xing Huang [ 03/Dec/22 ]

2022-12-03: The b2_15 patch of LU-16160 is being worked on. The master patch of LU-16064 is being reviewed.

Comment by Olaf Faaland [ 13/Dec/22 ]

Hi, do you have an update for this issue?  It is creating problems for at least two users.  Thanks

Comment by Peter Jones [ 03/Jan/23 ]

Bobijam

Do I understand correctly you intend to port  https://review.whamcloud.com/c/fs/lustre-release/+/49534 to b2_15 and then ask LLNL to use that in their reproducer?

Peter

Comment by Zhenyu Xu [ 04/Jan/23 ]

yes, I'd port it to b2_15 at https://review.whamcloud.com/c/fs/lustre-release/+/49553 

Comment by Peter Jones [ 16/Jan/23 ]

Olaf

Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch?

Peter

Comment by Olaf Faaland [ 23/Jan/23 ]

Peter,

My 2 users hitting what I suspect to be the same issue are running against older Lustre versions, 2.12 clients and 2.14 servers.

When Bobijam's backport allows rw_seq_cst_vs_drop_caches.c to succeed reliably, and passes the usual automated tests, then I'll pull it into our 2.15 branch and work on getting my users to run on our 2.15 machines. Right now the backport has a -1 from Maloo.

thanks

Comment by Patrick Farrell [ 23/Jan/23 ]

Olaf,

Please try https://review.whamcloud.com/c/fs/lustre-release/+/49647/ and https://review.whamcloud.com/c/fs/lustre-release/+/49653/ (both, applied in that order) the first is a similar but simpler fix for this issue, the second is a fix for a possible data inconsistency exposed (though not caused) by the first.  We will probably be taking this in preference to Bobi's patch.  (The Maloo failures on those are unrelated to the patches, I'm just waiting for review, etc, before retriggering testing.)

By the way, I've been able to reproduce this issue locally using the same sanity test Olaf referenced.  For me it's only reliably hittable in the presence of memory pressure, but that seems to be timing, since it shouldn't be a hard requirement for the bug to occur.  (Though might be involved in the real applications hitting the issue.)

Comment by Xing Huang [ 01/Apr/23 ]

2023-04-01: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is being reviewed.

Comment by Xing Huang [ 08/Apr/23 ]

2023-04-08: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is ready to land to master(in master-next branch).

Comment by Xing Huang [ 08/May/23 ]

2023-05-08: Both the two patches provided to LLNL for test landed(#49647 #49653) to master.

Comment by Peter Jones [ 13/May/23 ]

Both patches ready to land for b2_15.

Comment by Xing Huang [ 20/May/23 ]

2023-05-20: Both patches landed to b2_15.

Comment by Peter Jones [ 20/May/23 ]

Fix provided in upcoming 2.15.3 release. Marking as duplicate but will only remove topllnl label when LLNL have confirmed effectiveness of fixes.

Generated at Sat Feb 10 03:25:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.