[LU-16773] lustre-initialization-1: watchdog/crash in cfs_crypto_performance_test() Created: 26/Apr/23 Updated: 05/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: lustre-initialization failed with the following error several times on master since 2023-04-25: "onyx-32vm1 crashed during lustre-initialization-1"
[ 131.782070] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2
[ 131.840422] alg: No test for adler32 (adler32-zlib)
[ 156.040120] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:7372]
[ 156.047450] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4
virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata
crc32c_intel serio_raw virtio_net net_failover failover virtio_blk
[ 156.053201] CPU: 0 PID: 7372 Comm: modprobe Kdump: loaded 4.18.0-425.3.1.el8.x86_64 #1
[ 156.055346] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 156.056463] RIP: 0010:native_safe_halt+0xe/0x20
[ 156.073070] Call Trace:
[ 156.073591] kvm_wait+0x58/0x60
[ 156.074244] __pv_queued_spin_lock_slowpath+0x268/0x2a0
[ 156.075265] cfs_trace_unlock_tcd+0x20/0x70 [libcfs]
[ 156.076265] libcfs_debug_msg+0x98d/0xb10 [libcfs]
[ 156.079433] cfs_crypto_performance_test+0x202/0x320 [libcfs]
[ 156.082984] cfs_crypto_register+0x33/0x50 [libcfs]
[ 156.083953] libcfs_init+0x192/0x32e [libcfs]
[ 156.084828] do_one_initcall+0x46/0x1d0
[ 156.087318] do_init_module+0x5a/0x230
[ 156.088074] load_module+0x14bf/0x17f0
Test session details: |
| Comments |
| Comment by Arshad Hussain [ 16/May/23 ] |
|
+1 on Master (https://testing.whamcloud.com/test_sessions/e66c0e5a-b8be-4637-b275-7d040277f4d9) |
| Comment by James A Simmons [ 17/May/23 ] |
|
I wonder if https://review.whamcloud.com/#/c/fs/lustre-release/+/50992 will resolve this? |
| Comment by Andreas Dilger [ 05/Aug/23 ] |
|
Just hit this again on master/x86_64 with the 50992 patch applied: In hindsight, that patch would relate to lockdep warnings, but this is due to too-long running crypto performance testing at module load time. We probably need to reduce the amount of data being checked and/or push the checks to a work queue (one work item per compression type, maybe a fee per type to get better stats) so that it doesn't block module loading. This would also have the benefit of speeding up module loading and allow parallel execution while other modules are loaded. Just need to make sure work items are waited for or cancelled if module is unloaded quickly (though that should be rare). |