[LU-17283] sles12.5 always crashes at client unmount Created: 11/Nov/23 Updated: 12/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: Test session details: It looks like the sles12.5 client is crashing 100% of test runs on master right at unmount: 2025.506089] Lustre: Unmounted lustre-client [ 2025.507805] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 2025.509455] IP: wb_workfn+0x2b/0x450 [ 2025.511544] CPU: 0 PID: 282 Comm: kworker/u4:3 Tainted: G OE 4.12.14-122.133-default #1 SLE12-SP5 [ 2025.513428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 2025.514554] Workqueue: writeback wb_workfn [ 2025.516554] RIP: 0010:wb_workfn+0x2b/0x450 [ 2025.529303] Call Trace: [ 2025.532599] process_one_work+0x14c/0x390 [ 2025.533464] worker_thread+0x1c3/0x3e0 [ 2025.534241] kthread+0xf6/0x130 It looks like some kind of workqueue that is not flushed before unmount, or maybe RCU related? This is commit v2_15_58-183-g21295b169b (2 commits before 2.15.59). |
| Comments |
| Comment by Peter Jones [ 11/Nov/23 ] |
|
Do we need to worry about this? We're supporting the latest SLES15 SPx client for 2.16 and EOL is looming for this older version - https://endoflife.date/sles |
| Comment by Andreas Dilger [ 11/Nov/23 ] |
|
It looks like this first started crashing on 2023-10-16 and has crashed for every test run since then. Patches landed at that time: $ git log --oneline --after 2023-10-14 --before 2023-10-17 a9411a9856 LU-17076 nrs: wait for RCU completion 8d82cf1413 LU-17015 gss: bump token buffer size to 16KiB 4c6290087b LU-12896 gss: key can be unlinked when timeout expires 6f5870dd87 LU-16218 utils: add component flags "prefrd" and "prefwr" b156790dea LU-17136 ldiskfs: increase max extent tree depth 16e4383e90 LU-17129 tests: cleanup fileset info on nodemaps 3df9e032db LU-17109 kernel: new kernel [SLES15 SP5 5.14.21-150500.55.22.1] ce54b5281c LU-17084 lod: fix comparision in lod_striping_load() 2b3371d5ee LU-16796 target: Change struct barrier_instance to use kref 067dfd8d27 LU-8191 libcfs: convert functions to static, removed function 2d8c7027e9 LU-16962 build: cleanup configure messages 7cce9f2d1c LU-15002 utils: disable meta_bg and enable packed_meta_blocks 51529fb57f LU-16966 osd: take trunc_lock for fallocate I don't see anything obvious that would affect the client. Possibly the SLES15 SP5 patch changed something in the configure/build, or |
| Comment by Alex Zhuravlev [ 12/Nov/23 ] |
probably a broken in the original ext4 but now enabled and exposed? |
| Comment by Andreas Dilger [ 12/Nov/23 ] |
|
But this is a client crash, since we don't run SLES12 servers in testing. |