[LU-16129] on umount: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 Created: 31/Aug/22  Updated: 25/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0, Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Robert Redl Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Kernel: 4.18.0-372.19.1.el8_6.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On some clients we started to see crashes like this one:

[ 3245.563036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 3245.563067] PGD 0 P4D 0 
[ 3245.563075] Oops: 0000 [#1] SMP NOPTI
[ 3245.563085] CPU: 0 PID: 21272 Comm: ldlm_bl_05 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-372.19.1.el8_6.x86_64 #1
[ 3245.563110] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 3245.563130] RIP: 0010:ll_lock_cancel_bits+0x34f/0x920 [lustre]
[ 3245.563167] Code: af d8 48 89 c5 48 85 c0 74 10 48 89 c7 e8 59 fa ff ff 48 89 ef e8 f1 a3 af d8 48 8b 04 24 a8 11 74 24 48 8b 43 28 48 8b 40 68 <48> 3b 58 30 74 0e 48 89 df e8 93 8e fb ff f6 04 24 11 74 08 48 89
[ 3245.563201] RSP: 0018:ffffb1cb07e5fd20 EFLAGS: 00010202
[ 3245.563213] RAX: 0000000000000000 RBX: ffff970add7f5ca0 RCX: 0000000000000000
[ 3245.563227] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff970add7f5d28
[ 3245.563240] RBP: ffff970add7f5c00 R08: ffffb1cb07e5faa0 R09: 0000000000000000
[ 3245.563253] R10: 0000000000000000 R11: ffff970a8602a800 R12: 0000000000000012
[ 3245.563266] R13: 0000000000000000 R14: ffff970d7445a400 R15: ffff970d74458cf8
[ 3245.563281] FS:  0000000000000000(0000) GS:ffff970dafc00000(0000) knlGS:0000000000000000
[ 3245.563296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3245.563308] CR2: 0000000000000030 CR3: 0000000091410003 CR4: 0000000000770ef0
[ 3245.563324] PKRU: 55555554
[ 3245.563331] Call Trace:
[ 3245.563342]  ? __wake_up_common_lock+0x89/0xc0
[ 3245.563354]  ll_md_blocking_ast+0x198/0x2f0 [lustre]
[ 3245.563384]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
[ 3245.563446]  ldlm_cli_cancel_local+0xcb/0x440 [ptlrpc]
[ 3245.563506]  ldlm_cli_cancel_list_local+0x108/0x300 [ptlrpc]
[ 3245.563575]  ldlm_bl_thread_main+0x832/0x920 [ptlrpc]
[ 3245.563636]  ? finish_wait+0x80/0x80
[ 3245.563645]  ? ldlm_handle_bl_callback+0x3f0/0x3f0 [ptlrpc]
[ 3245.563704]  kthread+0x10a/0x120
[ 3245.563733]  ? set_kthread_struct+0x40/0x40
[ 3245.563744]  ret_from_fork+0x35/0x40
[ 3245.563755] Modules linked in: binfmt_misc mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) ksocklnd(OE) obdclass(OE) lnet(OE) libcfs(OE) sunrpc intel_rapl_msr intel_rapl_common amd_energy kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev i2c_piix4 ext4 mbcache jbd2 xfs libcrc32c sr_mod cdrom ata_generic bochs_drm drm_vram_helper sd_mod drm_kms_helper t10_pi sg syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_console failover virtio_scsi dm_mirror dm_region_hash dm_log dm_mod zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE)
[ 3245.563904] CR2: 0000000000000030 

This seems to happen when umount is executed, but I'm not 100% sure about that.



 Comments   
Comment by Robert Redl [ 31/Aug/22 ]

I just saw that this looks a lot like LU-15757, but it is still happening with Lustre 2.15.1.

Comment by Etienne Aujames [ 01/Sep/22 ]

Hello,

The patch https://review.whamcloud.com/47086 ("LU-15757 llite: check s_root ll_md_blocking_ast()") is not present on 2.15.1. So it is likely a duplicate.

Comment by Robert Redl [ 01/Sep/22 ]

Dear Etienne, thanks for pointing out that the patch for LU-15757 has not been merged for 2.15.1. I compiled Lustre now again with this patch merged and installed the new package on a few clients. I will report back if the problem shows up again or not.

Comment by James A Simmons [ 25/Aug/23 ]

LU-15757 has been back ported to 2.15.X (git commit e6dd92e27af98b38ba5bd8e7e818efa82971a145)

Generated at Sat Feb 10 03:24:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.