Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.15.0, Lustre 2.15.1
-
None
-
Kernel: 4.18.0-372.19.1.el8_6.x86_64
-
3
-
9223372036854775807
Description
On some clients we started to see crashes like this one:
[ 3245.563036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 3245.563067] PGD 0 P4D 0 [ 3245.563075] Oops: 0000 [#1] SMP NOPTI [ 3245.563085] CPU: 0 PID: 21272 Comm: ldlm_bl_05 Kdump: loaded Tainted: P OE --------- - - 4.18.0-372.19.1.el8_6.x86_64 #1 [ 3245.563110] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 3245.563130] RIP: 0010:ll_lock_cancel_bits+0x34f/0x920 [lustre] [ 3245.563167] Code: af d8 48 89 c5 48 85 c0 74 10 48 89 c7 e8 59 fa ff ff 48 89 ef e8 f1 a3 af d8 48 8b 04 24 a8 11 74 24 48 8b 43 28 48 8b 40 68 <48> 3b 58 30 74 0e 48 89 df e8 93 8e fb ff f6 04 24 11 74 08 48 89 [ 3245.563201] RSP: 0018:ffffb1cb07e5fd20 EFLAGS: 00010202 [ 3245.563213] RAX: 0000000000000000 RBX: ffff970add7f5ca0 RCX: 0000000000000000 [ 3245.563227] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff970add7f5d28 [ 3245.563240] RBP: ffff970add7f5c00 R08: ffffb1cb07e5faa0 R09: 0000000000000000 [ 3245.563253] R10: 0000000000000000 R11: ffff970a8602a800 R12: 0000000000000012 [ 3245.563266] R13: 0000000000000000 R14: ffff970d7445a400 R15: ffff970d74458cf8 [ 3245.563281] FS: 0000000000000000(0000) GS:ffff970dafc00000(0000) knlGS:0000000000000000 [ 3245.563296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3245.563308] CR2: 0000000000000030 CR3: 0000000091410003 CR4: 0000000000770ef0 [ 3245.563324] PKRU: 55555554 [ 3245.563331] Call Trace: [ 3245.563342] ? __wake_up_common_lock+0x89/0xc0 [ 3245.563354] ll_md_blocking_ast+0x198/0x2f0 [lustre] [ 3245.563384] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [ 3245.563446] ldlm_cli_cancel_local+0xcb/0x440 [ptlrpc] [ 3245.563506] ldlm_cli_cancel_list_local+0x108/0x300 [ptlrpc] [ 3245.563575] ldlm_bl_thread_main+0x832/0x920 [ptlrpc] [ 3245.563636] ? finish_wait+0x80/0x80 [ 3245.563645] ? ldlm_handle_bl_callback+0x3f0/0x3f0 [ptlrpc] [ 3245.563704] kthread+0x10a/0x120 [ 3245.563733] ? set_kthread_struct+0x40/0x40 [ 3245.563744] ret_from_fork+0x35/0x40 [ 3245.563755] Modules linked in: binfmt_misc mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) ksocklnd(OE) obdclass(OE) lnet(OE) libcfs(OE) sunrpc intel_rapl_msr intel_rapl_common amd_energy kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr joydev i2c_piix4 ext4 mbcache jbd2 xfs libcrc32c sr_mod cdrom ata_generic bochs_drm drm_vram_helper sd_mod drm_kms_helper t10_pi sg syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm ata_piix libata crc32c_intel virtio_net serio_raw net_failover virtio_console failover virtio_scsi dm_mirror dm_region_hash dm_log dm_mod zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) [ 3245.563904] CR2: 0000000000000030
This seems to happen when umount is executed, but I'm not 100% sure about that.