[LU-17283] sles12.5 always crashes at client unmount Created: 11/Nov/23  Updated: 12/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/f43b22b1-5c6c-444b-b9be-ecfe70a1c164

Test session details:
clients: https://build.whamcloud.com/job/lustre-master/4475 - 4.12.14-122.133-default
servers: https://build.whamcloud.com/job/lustre-master/4475 - 4.18.0-477.27.1.el8_lustre.x86_64

It looks like the sles12.5 client is crashing 100% of test runs on master right at unmount:

 2025.506089] Lustre: Unmounted lustre-client
[ 2025.507805] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[ 2025.509455] IP: wb_workfn+0x2b/0x450
[ 2025.511544] CPU: 0 PID: 282 Comm: kworker/u4:3 Tainted: G           OE      4.12.14-122.133-default #1 SLE12-SP5
[ 2025.513428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2025.514554] Workqueue: writeback wb_workfn
[ 2025.516554] RIP: 0010:wb_workfn+0x2b/0x450
[ 2025.529303] Call Trace:
[ 2025.532599]  process_one_work+0x14c/0x390
[ 2025.533464]  worker_thread+0x1c3/0x3e0
[ 2025.534241]  kthread+0xf6/0x130

It looks like some kind of workqueue that is not flushed before unmount, or maybe RCU related?

This is commit v2_15_58-183-g21295b169b (2 commits before 2.15.59).



 Comments   
Comment by Peter Jones [ 11/Nov/23 ]

Do we need to worry about this? We're supporting the latest SLES15 SPx client for 2.16 and EOL is looming for this older version - https://endoflife.date/sles

Comment by Andreas Dilger [ 11/Nov/23 ]

It looks like this first started crashing on 2023-10-16 and has crashed for every test run since then. Patches landed at that time:

$ git log --oneline --after 2023-10-14 --before 2023-10-17
a9411a9856 LU-17076 nrs: wait for RCU completion
8d82cf1413 LU-17015 gss: bump token buffer size to 16KiB
4c6290087b LU-12896 gss: key can be unlinked when timeout expires
6f5870dd87 LU-16218 utils: add component flags "prefrd" and "prefwr"
b156790dea LU-17136 ldiskfs: increase max extent tree depth
16e4383e90 LU-17129 tests: cleanup fileset info on nodemaps
3df9e032db LU-17109 kernel: new kernel [SLES15 SP5 5.14.21-150500.55.22.1]
ce54b5281c LU-17084 lod: fix comparision in lod_striping_load()
2b3371d5ee LU-16796 target: Change struct barrier_instance to use kref
067dfd8d27 LU-8191 libcfs: convert functions to static, removed function
2d8c7027e9 LU-16962 build: cleanup configure messages
7cce9f2d1c LU-15002 utils: disable meta_bg and enable packed_meta_blocks
51529fb57f LU-16966 osd: take trunc_lock for fallocate

I don't see anything obvious that would affect the client. Possibly the SLES15 SP5 patch changed something in the configure/build, or LU-16962 caused a configure check to break and we're building with a bad kernel ABI?

Comment by Alex Zhuravlev [ 12/Nov/23 ]

7cce9f2d1c LU-15002 utils: disable meta_bg and enable packed_meta_blocks

probably a broken in the original ext4 but now enabled and exposed?

Comment by Andreas Dilger [ 12/Nov/23 ]

But this is a client crash, since we don't run SLES12 servers in testing.

Generated at Sat Feb 10 03:34:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.