Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16224

rw_seq_cst_vs_drop_caches dies with SIGBUS

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • lustre-2.15.1_5.llnl
      4.18.0-372.26.1.1toss.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Running the reproducer from LU-14541 (rw_seq_cst_vs_drop_caches.c) fails about 50% of the time with Lustre 2.15.1 (both client and servers).

      [root@mutt21:toss-5803-sigbus]# ./run_test /p/olaf{a,b}/faaland1/test/sigbustest
      ++ ./rw_seq_cst_vs_drop_caches /p/olafa/faaland1/test/sigbustest /p/olafb/faaland1/test/sigbustest
      u = 60, v = { 60, 59 }
      ./run_test: line 11: 120055 Aborted                 (core dumped) ./rw_seq_cst_vs_drop_caches $1 $2
      ++ status=134
      ++ signum=6
      ++ case $signum in
      ++ echo FAIL with SIGBUS
      FAIL with SIGBUS
      

      Although it's not yet confirmed to be the same issue, we have two users reporting jobs dying with a bus error intermittently, when using Lustre for I/O, which is what prompted me to run this against Lustre 2.15.1.

      Attachments

        Issue Links

          Activity

            [LU-16224] rw_seq_cst_vs_drop_caches dies with SIGBUS
            pjones Peter Jones added a comment -

            Olaf

            Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch?

            Peter

            pjones Peter Jones added a comment - Olaf Do you have a reliable reproducer for this issue? Are able to test the effectiveness of the patch? Peter
            bobijam Zhenyu Xu added a comment - - edited
            bobijam Zhenyu Xu added a comment - - edited yes, I'd port it to b2_15 at  https://review.whamcloud.com/c/fs/lustre-release/+/49553  
            pjones Peter Jones added a comment -

            Bobijam

            Do I understand correctly you intend to port  https://review.whamcloud.com/c/fs/lustre-release/+/49534 to b2_15 and then ask LLNL to use that in their reproducer?

            Peter

            pjones Peter Jones added a comment - Bobijam Do I understand correctly you intend to port   https://review.whamcloud.com/c/fs/lustre-release/+/49534 to b2_15 and then ask LLNL to use that in their reproducer? Peter
            ofaaland Olaf Faaland added a comment -

            Hi, do you have an update for this issue?  It is creating problems for at least two users.  Thanks

            ofaaland Olaf Faaland added a comment - Hi, do you have an update for this issue?  It is creating problems for at least two users.  Thanks
            hxing Xing Huang added a comment -

            2022-12-03: The b2_15 patch of LU-16160 is being worked on. The master patch of LU-16064 is being reviewed.

            hxing Xing Huang added a comment - 2022-12-03: The b2_15 patch of LU-16160 is being worked on. The master patch of LU-16064 is being reviewed.
            ofaaland Olaf Faaland added a comment -

            Thank you for the update

            ofaaland Olaf Faaland added a comment - Thank you for the update
            hxing Xing Huang added a comment -

            2022-11-12: The b2_15 patch of LU-16160 is being updated according to master one. The master patch of LU-16064 needs to be rebased.

            hxing Xing Huang added a comment - 2022-11-12: The b2_15 patch of LU-16160 is being updated according to master one. The master patch of LU-16064 needs to be rebased.
            bobijam Zhenyu Xu added a comment -

            A revised patch which is trying to address the review question is under review, when it's passed I'd update the backports.

            bobijam Zhenyu Xu added a comment - A revised patch which is trying to address the review question is under review, when it's passed I'd update the backports.
            ofaaland Olaf Faaland added a comment -

            Hi Bobijam,

            What is the status of the two backports? I saw that there was a review question for one, and the other seems to have a build issue.

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - Hi Bobijam, What is the status of the two backports? I saw that there was a review question for one, and the other seems to have a build issue. thanks, Olaf
            bobijam Zhenyu Xu added a comment -

            thank you for the confirmation. Yes, besides LU-16160 patches, I also think LU-16064 is another ticket addressing the read inconsistency issue.

            bobijam Zhenyu Xu added a comment - thank you for the confirmation. Yes, besides LU-16160 patches, I also think LU-16064 is another ticket addressing the read inconsistency issue.
            ofaaland Olaf Faaland added a comment -

            Hi Bobijam,

            I pulled those changes (48804 and 48805) into our 2.15.1-based patch stack and confirmed that rw_seq_cst_vs_drop_caches runs successfully now. I gave both changes my +1 to reflect that.

            rw_seq_cst_vs_drop_caches fails on our 2.12.9-based clients as well. It looks to me like the two LU-16160 patches depend on a third patch (LU-14541 llite: Check vmpage in releasepage) which Etienne backported to b2_12 in change https://review.whamcloud.com/48311 but which never got reviews or was landed.

            Are those three patches the right ones for 2.12 to address the issue there?

            thanks

            ofaaland Olaf Faaland added a comment - Hi Bobijam, I pulled those changes (48804 and 48805) into our 2.15.1-based patch stack and confirmed that rw_seq_cst_vs_drop_caches runs successfully now. I gave both changes my +1 to reflect that. rw_seq_cst_vs_drop_caches fails on our 2.12.9-based clients as well. It looks to me like the two LU-16160 patches depend on a third patch ( LU-14541 llite: Check vmpage in releasepage) which Etienne backported to b2_12 in change https://review.whamcloud.com/48311 but which never got reviews or was landed. Are those three patches the right ones for 2.12 to address the issue there? thanks

            People

              bobijam Zhenyu Xu
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: