[LU-11063] RHEL7.[345] RCU breakage Created: 29/May/18  Updated: 07/Sep/23  Resolved: 27/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-12454 parallel-scale-nfsv3 test racer_on_nf... Resolved
Related
is related to LU-11568 Get rid of SLAB_DESTROY_BY_RCU Resolved
is related to LU-12374 client went down w/ panic during lust... Resolved
is related to LU-17097 RCU stall caused by osc_quota_cleanup Resolved
Severity: 3
Bugzilla ID: 18,015
Rank (Obsolete): 9223372036854775807

 Description   

I finally traced my debug kernel problems with later rhel releases to RCU breakage of some sort.

ldlm_locks slab is declared as SLAB_DESTROY_BY_RCU if it's defined This is going back to bugzilla 18015 https://bugzilla.lustre.org/show_bug.cgi?id=18015 patch by BobiJam.

Now it appears that as we schedule a free in that slab and then destroy the slab, the actual free is delayed and is executed after the slab is already freed despite rcu_barrier() being present.

Clear bug that I will file rh bugzilla ticket for.

But in addition to that I wonder how much do we need that thing nowadays, esp. considering that newer kernels renamed the flag to SLAB_TYPESAFE_BY_RCU that we do not detect and just not set it in that case.

Should we just convert ldlm_locks into a normal slab again I wonder?



 Comments   
Comment by Oleg Drokin [ 20/Jun/19 ]

This appears to be a deeper case than just removing the flag as was done in this patch https://review.whamcloud.com/34147 for LU-11568 and LU-12374 does not seem to help either.

All such crashes are in parallel nfs scale testing on rhel only so I must conclude this is just some deep rhel7 bug.

Comment by James Nunez (Inactive) [ 23/Nov/21 ]

Oleg - DO you think the crash in parallel-scale-nfsv3 racer_on_nfs at https://testing.whamcloud.com/test_sets/6f865cfa-2575-4ccc-8677-6ff091967e56 is the same issue as this ticket? I ask because this is a SLES 15 SP2 test session.

Comment by Oleg Drokin [ 24/Nov/21 ]

no, it's a different generic kernel bug. The patch for it is here (not sure how you can actually get SuSE to include it though): https://patchwork.kernel.org/project/linux-nfs/cover/cover.1568377101.git.bcodding@redhat.com/

Comment by Peter Jones [ 24/Nov/21 ]

Perhaps neilb would care to comment?

Comment by Neil Brown [ 24/Nov/21 ]

If you want to get a patch into SUSE (all upper-case these days) you open an issue on bugzilla.suse.com, and explain what and why.  If you assign the issue to me (or put me on cc or somehow let me know about it - I'm nfbrown@suse.com in bugzilla) I can expedite it.

Or you can ask me directly,  then I can create the bugzilla issue myself.

What would be even better would be for these fixup patches to have been marked "Fixes: ......".  Then I would have be alerted to them by our automated machinery.  It's too late for that though...

 

Generated at Sat Feb 10 02:40:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.