Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16224

rw_seq_cst_vs_drop_caches dies with SIGBUS

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • lustre-2.15.1_5.llnl
      4.18.0-372.26.1.1toss.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Running the reproducer from LU-14541 (rw_seq_cst_vs_drop_caches.c) fails about 50% of the time with Lustre 2.15.1 (both client and servers).

      [root@mutt21:toss-5803-sigbus]# ./run_test /p/olaf{a,b}/faaland1/test/sigbustest
      ++ ./rw_seq_cst_vs_drop_caches /p/olafa/faaland1/test/sigbustest /p/olafb/faaland1/test/sigbustest
      u = 60, v = { 60, 59 }
      ./run_test: line 11: 120055 Aborted                 (core dumped) ./rw_seq_cst_vs_drop_caches $1 $2
      ++ status=134
      ++ signum=6
      ++ case $signum in
      ++ echo FAIL with SIGBUS
      FAIL with SIGBUS
      

      Although it's not yet confirmed to be the same issue, we have two users reporting jobs dying with a bus error intermittently, when using Lustre for I/O, which is what prompted me to run this against Lustre 2.15.1.

      Attachments

        Issue Links

          Activity

            [LU-16224] rw_seq_cst_vs_drop_caches dies with SIGBUS
            pjones Peter Jones added a comment -

            Fix provided in upcoming 2.15.3 release. Marking as duplicate but will only remove topllnl label when LLNL have confirmed effectiveness of fixes.

            pjones Peter Jones added a comment - Fix provided in upcoming 2.15.3 release. Marking as duplicate but will only remove topllnl label when LLNL have confirmed effectiveness of fixes.
            hxing Xing Huang added a comment -

            2023-05-20: Both patches landed to b2_15.

            hxing Xing Huang added a comment - 2023-05-20: Both patches landed to b2_15.
            pjones Peter Jones added a comment -

            Both patches ready to land for b2_15.

            pjones Peter Jones added a comment - Both patches ready to land for b2_15.
            hxing Xing Huang added a comment -

            2023-05-08: Both the two patches provided to LLNL for test landed(#49647 #49653) to master.

            hxing Xing Huang added a comment - 2023-05-08: Both the two patches provided to LLNL for test landed(#49647 #49653) to master.
            hxing Xing Huang added a comment -

            2023-04-08: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is ready to land to master(in master-next branch).

            hxing Xing Huang added a comment - 2023-04-08: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is ready to land to master(in master-next branch).
            hxing Xing Huang added a comment -

            2023-04-01: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is being reviewed.

            hxing Xing Huang added a comment - 2023-04-01: Two patches provided to LLNL for test, one patch(#49647) landed to master, another one(#49653) is being reviewed.

            Olaf,

            Please try https://review.whamcloud.com/c/fs/lustre-release/+/49647/ and https://review.whamcloud.com/c/fs/lustre-release/+/49653/ (both, applied in that order) the first is a similar but simpler fix for this issue, the second is a fix for a possible data inconsistency exposed (though not caused) by the first.  We will probably be taking this in preference to Bobi's patch.  (The Maloo failures on those are unrelated to the patches, I'm just waiting for review, etc, before retriggering testing.)

            By the way, I've been able to reproduce this issue locally using the same sanity test Olaf referenced.  For me it's only reliably hittable in the presence of memory pressure, but that seems to be timing, since it shouldn't be a hard requirement for the bug to occur.  (Though might be involved in the real applications hitting the issue.)

            paf0186 Patrick Farrell added a comment - Olaf, Please try https://review.whamcloud.com/c/fs/lustre-release/+/49647/ and https://review.whamcloud.com/c/fs/lustre-release/+/49653/ (both, applied in that order) the first is a similar but simpler fix for this issue, the second is a fix for a possible data inconsistency exposed (though not caused) by the first.  We will probably be taking this in preference to Bobi's patch.  (The Maloo failures on those are unrelated to the patches, I'm just waiting for review, etc, before retriggering testing.) By the way, I've been able to reproduce this issue locally using the same sanity test Olaf referenced.  For me it's only reliably hittable in the presence of memory pressure, but that seems to be timing, since it shouldn't be a hard requirement for the bug to occur.  (Though might be involved in the real applications hitting the issue.)
            ofaaland Olaf Faaland added a comment -

            Peter,

            My 2 users hitting what I suspect to be the same issue are running against older Lustre versions, 2.12 clients and 2.14 servers.

            When Bobijam's backport allows rw_seq_cst_vs_drop_caches.c to succeed reliably, and passes the usual automated tests, then I'll pull it into our 2.15 branch and work on getting my users to run on our 2.15 machines. Right now the backport has a -1 from Maloo.

            thanks

            ofaaland Olaf Faaland added a comment - Peter, My 2 users hitting what I suspect to be the same issue are running against older Lustre versions, 2.12 clients and 2.14 servers. When Bobijam's backport allows rw_seq_cst_vs_drop_caches.c to succeed reliably, and passes the usual automated tests, then I'll pull it into our 2.15 branch and work on getting my users to run on our 2.15 machines. Right now the backport has a -1 from Maloo. thanks
            pjones Peter Jones added a comment -

            Olaf

            Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch?

            Peter

            pjones Peter Jones added a comment - Olaf Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch? Peter
            bobijam Zhenyu Xu added a comment - - edited
            bobijam Zhenyu Xu added a comment - - edited yes, I'd port it to b2_15 at  https://review.whamcloud.com/c/fs/lustre-release/+/49553  

            People

              bobijam Zhenyu Xu
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: