[LU-16727] kmod-lustre-client 2.15.2 on rockylinux 8.7(4.18.0-425.19.2.el8_7.x86_64) freezes and locks cpu Created: 11/Apr/23  Updated: 23/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Kevin Konzem Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

fresh install of rockylinux 8.7(4.18.0-425.19.2.el8_7.x86_64) on a vm


Attachments: PNG File MicrosoftTeams-image (8).png    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have been working on upgrading some systems to rockylinux 8.7 and lustre 2.15.2. When I try to upgrade a client, the system freezes up hard the moment I try to modprobe either lustre or lnet. I have to hard power down the system in vmware to get it to boot again. My coworker suspects it is because of a kernel module verification failing, similar to this issue:
https://stackoverflow.com/questions/24975377/kvm-module-verification-failed-signature-and-or-required-key-missing-taintin
But I am unable to tell if that is truly the problem.

I see in the changelog for 2.15.2 that only up to kernel 4.18.0-425.3.1.el8 is supported, and I am able to get 2.15.2 working by downgrading the kernel to 425.3.1, but my security department is rather draconian and will not let me run a non-current kernel. Is there anything I can try to get the lustre client working on 425.19.2?



 Comments   
Comment by Andreas Dilger [ 11/Apr/23 ]

I can't comment on why the existing modules are failing (having a serial console or netconsole attached to the client when it hangs would help), but youu can always build your own kernel modules for your specific kernel, that's what open source is all about.

Comment by Kevin Konzem [ 12/Apr/23 ]

we were able to get an strace of running modprobe lustre. We are not against building our own modules, but we prefer to use precompiled rpms whenever possible, makes our patching process much easier.

Comment by Andreas Dilger [ 16/Apr/23 ]

The strace shows that the hang is right with the first module libcfs, and it is coming from the weak-updates directory, so definitely not the matching kernel.

However, this doesn't show anything about why the module is hanging.  That would be shown on the serial console.  If the node is still accessible, then "dmesg" would show the stack trace associated with the "watchdog: soft lockup" message.

If the node is inaccessible after the module load, then you need to attach a (virtual) serial console either to a real serial port, IPMI over ethernet, or configure netconsole to send the kernel messages to another node.

Comment by Gregoire Pichon [ 21/Jun/23 ]

I have reproduced this issue on a RedHat 8.7 with kernel 4.18.0-425.19.2.el8_7.x86_64 and Lustre 2.15.2 (lustre-client-2.15.2-1.el8.x86_64 rpm from Whamcloud download site).

Here are the dmesg traces from the crash dump after loading libcfs module

# modprobe -v libcfs
insmod /lib/modules/4.18.0-425.19.2.el8_7.x86_64/weak-updates/lustre-client/net/libcfs.ko 

 

[  946.664603] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30Modules linked in: libcfs(OE+) ptlnet(OE) bxi_portals(OE) bxi(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp ast drm_vram_helper drm_ttm_helper coretemp iTCO_wdt kvm_intel kvm ttm irqbypass ipmi_ssif drm_kms_helper syscopyarea sysfillrect sysimgblt iTCO_vendor_support fb_sys_fops crct10dif_pclmul crc32_pclmul mei_me mei ghash_clmulni_intel drm joydev sunrpc acpi_ipmi rapl lpc_ich intel_cstate ipmi_si intel_uncore ipmi_devintf ipmi_msghandler ioatdma pcspkr i2c_i801 wmi acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ahci igb libahci crc32c_intel libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: bxi]
[  946.664625] CPU: 30 PID: 23972 Comm: modprobe Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.19.2.el8_7.x86_64 #1
[  946.664625] Hardware name: Bull SAS R424-E4/X10DRT-P, BIOS 3.2 11/19/2019
[  946.664626] RIP: 0010:native_queued_spin_lock_slowpath+0x61/0x1c0
[  946.664627] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81 fe 00
[  946.664627] RSP: 0018:ffffadc70894fc00 EFLAGS: 00000002
[  946.664628] RAX: 0000000000000101 RBX: ffff9cb9c4436000 RCX: 0000000000000010
[  946.664629] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9cb9c4436000
[  946.664629] RBP: ffffadc70894fc98 R08: 0000000000007604 R09: ffff9cb9c4435780
[  946.664630] R10: 0000000000000000 R11: ffff9cc8ffca9c04 R12: ffff9cb9c4436000
[  946.664630] R13: 0000000000000000 R14: ffffffffc0c57e50 R15: 0000000000000000
[  946.664631] FS:  00007fc409180740(0000) GS:ffff9cc8ffc80000(0000) knlGS:0000000000000000
[  946.664631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  946.664631] CR2: 00007ffe0eca07d0 CR3: 0000000125994002 CR4: 00000000001706e0
[  946.664632] Call Trace:
[  946.664632]  _raw_spin_lock_irq+0x25/0x2c
[  946.664632]  cfs_trace_lock_tcd+0x75/0x80 [libcfs]
[  946.664633]  cfs_tracefile_exit+0xb2/0x2a0 [libcfs]
[  946.664633]  ? cfs_debug_init+0x1d/0x1d [libcfs]
[  946.664633]  libcfs_debug_cleanup+0x29/0x40 [libcfs]
[  946.664634]  libcfs_init+0x318/0x320 [libcfs]
[  946.664634]  do_one_initcall+0x46/0x1d0
[  946.664634]  ? do_init_module+0x22/0x230
[  946.664635]  ? kmem_cache_alloc_trace+0x142/0x280
[  946.664635]  do_init_module+0x5a/0x230
[  946.664635]  load_module+0x14bf/0x17f0
[  946.664635]  ? __do_sys_finit_module+0xb1/0x110
[  946.664636]  __do_sys_finit_module+0xb1/0x110
[  946.664636]  do_syscall_64+0x5b/0x1b0
[  946.664637]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  946.664637] RIP: 0033:0x7fc4080949bd
[  946.664638] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 64 38 00 f7 d8 64 89 01 48
[  946.664639] RSP: 002b:00007ffe0eca37f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  946.664639] RAX: ffffffffffffffda RBX: 000055fd6a0d40f0 RCX: 00007fc4080949bd
[  946.664640] RDX: 0000000000000000 RSI: 000055fd692128b6 RDI: 0000000000000003
[  946.664640] RBP: 000055fd692128b6 R08: 0000000000000000 R09: 0000000000000000
[  946.664641] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[  946.664641] R13: 000055fd6a0d4090 R14: 0000000000040000 R15: 0000000000000000
[  946.664642] Kernel panic - not syncing: Hard LOCKUP
[  946.664642] CPU: 30 PID: 23972 Comm: modprobe Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.19.2.el8_7.x86_64 #1
[  946.664643] Hardware name: Bull SAS R424-E4/X10DRT-P, BIOS 3.2 11/19/2019
[  946.664643] Call Trace:
[  946.664643]  <NMI>
[  946.664644]  dump_stack+0x41/0x60
[  946.664644]  panic+0xe7/0x2ac
[  946.664644]  nmi_panic.cold.11+0xc/0xc
[  946.664644]  watchdog_overflow_callback.cold.7+0x5c/0x70
[  946.664645]  __perf_event_overflow+0x52/0x100
[  946.664645]  handle_pmi_common+0x1f7/0x2c0
[  946.664646]  ? __set_pte_vaddr+0x32/0x50
[  946.664646]  ? __native_set_fixmap+0x24/0x40
[  946.664646]  intel_pmu_handle_irq+0xeb/0x420
[  946.664646]  perf_event_nmi_handler+0x2d/0x50
[  946.664647]  nmi_handle+0x63/0x110
[  946.664647]  default_do_nmi+0x49/0x110
[  946.664647]  do_nmi+0x1af/0x220
[  946.664648]  end_repeat_nmi+0x16/0x69

 

 

Comment by James A Simmons [ 21/Jun/23 ]

Can you try patch https://review.whamcloud.com/c/fs/lustre-release/+/50992.

Comment by Gregoire Pichon [ 23/Jun/23 ]

I applied the patch mentioned in previous note on top of lustre 2.15.2, but faced again a "soft lockup".

# insmod /root/rpmbuild/BUILDROOT/lustre-client-2.15.2-1.el8.x86_64/lib/modules/4.18.0-425.19.2.el8_7.x86_64/extra/lustre-client/net/libcfs.ko
# modprobe -v lnet
insmod /lib/modules/4.18.0-425.19.2.el8_7.x86_64/weak-updates/lustre-client/net/lnet.ko
# lctl net up
Message from syslogd@quito7 at Jun 23 11:18:19 ...
 kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [lctl:2442] 
[  268.397701] libcfs: loading out-of-tree module taints kernel.
[  268.403673] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[  268.417815] LNet: HW NUMA nodes: 2, HW CPU cores: 48, npartitions: 2
[  268.426107] alg: No test for adler32 (adler32-zlib)
[  269.182362] Key type ._llcrypt registered
[  269.186371] Key type .llcrypt registered
[  287.320016] LNet: Added LNI 10.0.1.8@tcp [8/256/0/180]
[  287.325271] LNet: Accept secure, port 988
[  312.093524] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [lctl:2442]
[  312.100302] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ast drm_vram_helper drm_ttm_helper kvm ipmi_ssif ttm irqbypass iTCO_wdt iTCO_vendor_support drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt fb_sys_fops ghash_clmulni_intel drm rapl joydev intel_cstate acpi_ipmi wmi intel_uncore pcspkr ipmi_si mei_me mei ioatdma ipmi_devintf lpc_ich acpi_power_meter i2c_i801 ipmi_msghandler acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca dm_mirror dm_region_hash dm_log dm_mod
[  312.164237] CPU: 0 PID: 2442 Comm: lctl Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.19.2.el8_7.x86_64 #1
[  312.175693] Hardware name: Bull SAS R424-E4/X10DRT-P, BIOS 3.2 11/19/2019
[  312.182472] RIP: 0010:native_queued_spin_lock_slowpath+0x5f/0x1c0
[  312.188563] Code: 71 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 e9 4e b6 aa 00 8b 37 81
[  312.207300] RSP: 0018:ffffa976481dbba8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  312.214858] RAX: 0000000000000101 RBX: ffffa976481dbca0 RCX: 0000000000000000
[  312.221982] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a31c5b974e0
[  312.229106] RBP: ffff9a31c5b974f0 R08: ffff9a22c3453e10 R09: ffff9a29312c2880
[  312.236230] R10: 0000000000608040 R11: 0000000000000246 R12: ffffa976481dbcf0
[  312.243355] R13: 0000000000000000 R14: ffff9a31c5b974f0 R15: ffff9a29312c2880
[  312.250480] FS:  00007f44ea12c740(0000) GS:ffff9a317f800000(0000) knlGS:0000000000000000
[  312.258558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  312.264295] CR2: 0000564468043f44 CR3: 00000001834c4003 CR4: 00000000001706f0
[  312.271419] Call Trace:
[  312.273862]  _raw_spin_lock+0x1e/0x30
[  312.277520]  lnet_ptl_attach_md+0xe0/0x5a0 [lnet]
[  312.282255]  ? lnet_res_lh_initialize+0x53/0x70 [lnet]
[  312.287400]  LNetMDAttach+0xd5/0x230 [lnet]
[  312.291586]  lnet_ping_target_setup+0x114/0x2c0 [lnet]
[  312.296725]  ? lnet_ping_target_fini+0xd0/0xd0 [lnet]
[  312.301779]  LNetNIInit+0x7cd/0xcf0 [lnet]
[  312.305887]  ? _cond_resched+0x15/0x30
[  312.309632]  lnet_configure+0x4e/0x70 [lnet]
[  312.313912]  lnet_ioctl+0x9a/0x260 [lnet]
[  312.317933]  notifier_call_chain+0x47/0x70
[  312.322025]  blocking_notifier_call_chain+0x42/0x60
[  312.326896]  libcfs_psdev_ioctl+0x34a/0x590 [libcfs]
[  312.331862]  do_vfs_ioctl+0xa4/0x690
[  312.335432]  ? syscall_trace_enter+0x1ff/0x2d0
[  312.339869]  ksys_ioctl+0x64/0xa0
[  312.343179]  __x64_sys_ioctl+0x16/0x20
[  312.346924]  do_syscall_64+0x5b/0x1b0
[  312.350581]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  312.355624] RIP: 0033:0x7f44e86d17cb
[  312.359197] Code: 73 01 c3 48 8b 0d bd 66 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 66 38 00 f7 d8 64 89 01 48
[  312.377933] RSP: 002b:00007ffe3e71a718 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  312.385491] RAX: ffffffffffffffda RBX: 00007f44e9f0fb80 RCX: 00007f44e86d17cb
[  312.392615] RDX: 00007ffe3e71a750 RSI: 00000000c008653b RDI: 0000000000000003
[  312.399740] RBP: 00000000c008653b R08: 00007f44ea13d5e0 R09: 0000000000000003
[  312.406864] R10: 000000000000000f R11: 0000000000000206 R12: 0000555867a0625e
[  312.413988] R13: 00007ffe3e71a750 R14: 0000000000000002 R15: 0000000000000000 

 

Generated at Sat Feb 10 03:29:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.