Details
-
Bug
-
Resolution: Fixed
-
Medium
-
None
-
None
-
3
-
9223372036854775807
Description
On a server running with 2.14.0-ddn224 we have experienced the following crash :
............................ [15512950.624462] Lustre: 3108282:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) ctx hdl 0x8475e97e671260bd does not have mech ctx: rc = -2 [15512950.626740] Lustre: 3108282:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) Skipped 6 previous similar messages [15513002.737034] Lustre: 3049973:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) ctx hdl 0x8475e97e671045b1 does not have mech ctx: rc = -2 [15513002.739346] Lustre: 3049973:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) Skipped 12 previous similar messages [15513017.063399] Lustre: 3049973:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) ctx hdl 0x8475e97e67124544 does not have mech ctx: rc = -2 [15513017.065681] Lustre: 3049973:0:(gss_svc_upcall.c:702:gss_svc_searchbyctx()) Skipped 19 previous similar messages [15513030.616096] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [15513030.618039] PGD 0 [15513030.618936] Oops: 0000 [#1] SMP NOPTI [15513030.619952] CPU: 1 PID: 553126 Comm: mdt00_040 Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.53.1.el8_lustre.ddn17.x86_64 #1 [15513030.622581] Hardware name: DDN SFA400NVX2E, BIOS 1.16.3-20250613_205747-8ebaeaa02506 04/01/2014 [15513030.624401] RIP: 0010:hash_walk_new_entry+0x9/0x60 [15513030.625617] Code: 74 0d f7 d1 44 21 c1 83 c1 01 39 c8 0f 47 c1 29 47 18 c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 77 20 <8b> 46 08 89 47 08 89 c2 48 8b 0e 25 ff 0f 00 00 c1 ea 0c 89 47 08 [15513030.629644] RSP: 0018:ff4da25f13b6bad0 EFLAGS: 00010246 [15513030.630970] RAX: 0000000000000000 RBX: ff225ccd8203bf50 RCX: 0000000000000000 [15513030.632604] RDX: 0000000000000018 RSI: 0000000000000000 RDI: ff4da25f13b6bad8 [15513030.634194] RBP: 0000000000000000 R08: ff225cceb7288fcc R09: 0000000000000008 [15513030.635803] R10: 0000000200000005 R11: ff225cd184619480 R12: ff4da25f13b6bb40 [15513030.637497] R13: ff4da25f13b6bcf0 R14: 0000000000000000 R15: ff225ccd8203bf00 [15513030.639277] FS: 0000000000000000(0000) GS:ff225cefb1640000(0000) knlGS:0000000000000000 [15513030.641166] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15513030.642626] CR2: 0000000000000008 CR3: 0000001a1dc10003 CR4: 0000000000771ee0 [15513030.644227] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15513030.645871] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [15513030.647560] PKRU: 55555554 [15513030.648391] Call Trace: [15513030.649176] ? __die_body+0x1a/0x60 [15513030.650120] ? no_context+0x1ba/0x3f0 [15513030.651096] ? __bad_area_nosemaphore+0x157/0x180 [15513030.652283] ? do_page_fault+0x37/0x12d [15513030.653373] ? page_fault+0x1e/0x30 [15513030.654309] ? hash_walk_new_entry+0x9/0x60 [15513030.655367] shash_ahash_update+0x41/0x70 [15513030.656442] gss_digest_hash+0x8a/0x1d0 [ptlrpc_gss] [15513030.657630] krb5_make_checksum+0x22c/0x5d0 [ptlrpc_gss] [15513030.659040] ? gss_crypt_generic+0x320/0x320 [ptlrpc_gss] [15513030.660444] ? __rsc_free+0x1e/0x30 [ptlrpc_gss] [15513030.661639] gss_verify_mic_kerberos+0xcc/0x3a0 [ptlrpc_gss] [15513030.663002] ? gss_crypt_generic+0x320/0x320 [ptlrpc_gss] [15513030.664378] gss_verify_msg+0xce/0x1d0 [ptlrpc_gss] [15513030.665607] gss_svc_verify_request+0x363/0x6a0 [ptlrpc_gss] [15513030.667008] gss_svc_accept+0x7f6/0xae0 [ptlrpc_gss] [15513030.668223] sptlrpc_svc_unwrap_request+0x19c/0x650 [ptlrpc] [15513030.669688] ptlrpc_server_handle_req_in+0xf8/0x8f0 [ptlrpc] [15513030.671061] ptlrpc_main+0xaef/0x13a0 [ptlrpc] [15513030.672267] ? ptlrpc_register_service+0xf30/0xf30 [ptlrpc] [15513030.673603] kthread+0x134/0x150 [15513030.674452] ? set_kthread_struct+0x50/0x50 [15513030.675433] ret_from_fork+0x1f/0x40 [15513030.676392] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) ptlrpc_gss(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) binfmt_misc sctp ip6_udp_tunnel udp_tunnel libcrc32c rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel bochs drm_vram_helper kvm drm_ttm_helper ttm drm_kms_helper irqbypass crct10dif_pclmul syscopyarea crc32_pclmul iTCO_wdt sysfillrect ghash_clmulni_intel iTCO_vendor_support sysimgblt rapl drm i2c_i801 pcspkr joydev lpc_ich i6300esb auth_rpcgss sunrpc ext4 mbcache jbd2 sd_mod t10_pi mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sr_mod cdrom sg mlx5_core(OE) pci_hyperv_intf ahci mlxdevm(OE) psample libahci mlxfw(OE) bnxt_en libata mlx_compat(OE) virtio_net crc32c_intel tls serio_raw net_failover virtio_blk virtio_scsi failover [15513030.676462] dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [15513030.694799] Red Hat flags: eBPF/cgroup [15513030.695723] CR2: 0000000000000008
that seems to be a one shoot for instance.
After some heavy debugging on the crash-dump content associated to both ptlrpc_gss and Kernel crypto/scatterlist source code, it seems that we have triggered some kind of corner-case, where one of a GSS request message buffer (the 2nd in our case) spans on 2 pages, causing a sg_table to be setup/allocated in gss_digest_hash().
Unfortunatelly, there seems to be a bug in gss_digest_hash() where this condition is not handled correctly and where prealloc_sg continues to be wrongly referenced when entering Kernel crypto code with crypto_ahash_update().
The BUG()/Oops is then being triggered, because when using prealloc_sg filled with 1st buffer infos, wrong size is being used and SG_END (due to a single scatterlist struct) is being encountered too early !!
I will cook a fix proposal soon.