Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 18,015
    • 9223372036854775807

    Description

      I finally traced my debug kernel problems with later rhel releases to RCU breakage of some sort.

      ldlm_locks slab is declared as SLAB_DESTROY_BY_RCU if it's defined This is going back to bugzilla 18015 https://bugzilla.lustre.org/show_bug.cgi?id=18015 patch by BobiJam.

      Now it appears that as we schedule a free in that slab and then destroy the slab, the actual free is delayed and is executed after the slab is already freed despite rcu_barrier() being present.

      Clear bug that I will file rh bugzilla ticket for.

      But in addition to that I wonder how much do we need that thing nowadays, esp. considering that newer kernels renamed the flag to SLAB_TYPESAFE_BY_RCU that we do not detect and just not set it in that case.

      Should we just convert ldlm_locks into a normal slab again I wonder?

      Attachments

        Issue Links

          Activity

            [LU-11063] RHEL7.[345] RCU breakage
            neilb Neil Brown added a comment -

            If you want to get a patch into SUSE (all upper-case these days) you open an issue on bugzilla.suse.com, and explain what and why.  If you assign the issue to me (or put me on cc or somehow let me know about it - I'm nfbrown@suse.com in bugzilla) I can expedite it.

            Or you can ask me directly,  then I can create the bugzilla issue myself.

            What would be even better would be for these fixup patches to have been marked "Fixes: ......".  Then I would have be alerted to them by our automated machinery.  It's too late for that though...

             

            neilb Neil Brown added a comment - If you want to get a patch into SUSE (all upper-case these days) you open an issue on bugzilla.suse.com, and explain what and why.  If you assign the issue to me (or put me on cc or somehow let me know about it - I'm nfbrown@suse.com in bugzilla) I can expedite it. Or you can ask me directly,  then I can create the bugzilla issue myself. What would be even better would be for these fixup patches to have been marked "Fixes: ......".  Then I would have be alerted to them by our automated machinery.  It's too late for that though...  
            pjones Peter Jones added a comment -

            Perhaps neilb would care to comment?

            pjones Peter Jones added a comment - Perhaps neilb would care to comment?
            green Oleg Drokin added a comment -

            no, it's a different generic kernel bug. The patch for it is here (not sure how you can actually get SuSE to include it though): https://patchwork.kernel.org/project/linux-nfs/cover/cover.1568377101.git.bcodding@redhat.com/

            green Oleg Drokin added a comment - no, it's a different generic kernel bug. The patch for it is here (not sure how you can actually get SuSE to include it though): https://patchwork.kernel.org/project/linux-nfs/cover/cover.1568377101.git.bcodding@redhat.com/

            Oleg - DO you think the crash in parallel-scale-nfsv3 racer_on_nfs at https://testing.whamcloud.com/test_sets/6f865cfa-2575-4ccc-8677-6ff091967e56 is the same issue as this ticket? I ask because this is a SLES 15 SP2 test session.

            jamesanunez James Nunez (Inactive) added a comment - Oleg - DO you think the crash in parallel-scale-nfsv3 racer_on_nfs at https://testing.whamcloud.com/test_sets/6f865cfa-2575-4ccc-8677-6ff091967e56 is the same issue as this ticket? I ask because this is a SLES 15 SP2 test session.
            green Oleg Drokin added a comment -

            This appears to be a deeper case than just removing the flag as was done in this patch https://review.whamcloud.com/34147 for LU-11568 and LU-12374 does not seem to help either.

            All such crashes are in parallel nfs scale testing on rhel only so I must conclude this is just some deep rhel7 bug.

            green Oleg Drokin added a comment - This appears to be a deeper case than just removing the flag as was done in this patch https://review.whamcloud.com/34147 for LU-11568 and LU-12374 does not seem to help either. All such crashes are in parallel nfs scale testing on rhel only so I must conclude this is just some deep rhel7 bug.

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: