[LU-11063] RHEL7.[345] RCU breakage Created: 29/May/18 Updated: 07/Sep/23 Resolved: 27/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Bugzilla ID: | 18,015 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
I finally traced my debug kernel problems with later rhel releases to RCU breakage of some sort. ldlm_locks slab is declared as SLAB_DESTROY_BY_RCU if it's defined This is going back to bugzilla 18015 https://bugzilla.lustre.org/show_bug.cgi?id=18015 patch by BobiJam. Now it appears that as we schedule a free in that slab and then destroy the slab, the actual free is delayed and is executed after the slab is already freed despite rcu_barrier() being present. Clear bug that I will file rh bugzilla ticket for. But in addition to that I wonder how much do we need that thing nowadays, esp. considering that newer kernels renamed the flag to SLAB_TYPESAFE_BY_RCU that we do not detect and just not set it in that case. Should we just convert ldlm_locks into a normal slab again I wonder? |
| Comments |
| Comment by Oleg Drokin [ 20/Jun/19 ] |
|
This appears to be a deeper case than just removing the flag as was done in this patch https://review.whamcloud.com/34147 for All such crashes are in parallel nfs scale testing on rhel only so I must conclude this is just some deep rhel7 bug. |
| Comment by James Nunez (Inactive) [ 23/Nov/21 ] |
|
Oleg - DO you think the crash in parallel-scale-nfsv3 racer_on_nfs at https://testing.whamcloud.com/test_sets/6f865cfa-2575-4ccc-8677-6ff091967e56 is the same issue as this ticket? I ask because this is a SLES 15 SP2 test session. |
| Comment by Oleg Drokin [ 24/Nov/21 ] |
|
no, it's a different generic kernel bug. The patch for it is here (not sure how you can actually get SuSE to include it though): https://patchwork.kernel.org/project/linux-nfs/cover/cover.1568377101.git.bcodding@redhat.com/ |
| Comment by Peter Jones [ 24/Nov/21 ] |
|
Perhaps neilb would care to comment? |
| Comment by Neil Brown [ 24/Nov/21 ] |
|
If you want to get a patch into SUSE (all upper-case these days) you open an issue on bugzilla.suse.com, and explain what and why. If you assign the issue to me (or put me on cc or somehow let me know about it - I'm nfbrown@suse.com Or you can ask me directly, then I can create the bugzilla issue myself. What would be even better would be for these fixup patches to have been marked "Fixes: ......". Then I would have be alerted to them by our automated machinery. It's too late for that though...
|