[LU-16773] lustre-initialization-1: watchdog/crash in cfs_crypto_performance_test() Created: 26/Apr/23  Updated: 05/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/80638d38-67a2-4a86-b9b5-473cd0bffa0a

lustre-initialization failed with the following error several times on master since 2023-04-25:

"onyx-32vm1 crashed during lustre-initialization-1"
[  131.782070] libcfs: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 2
[  131.840422] alg: No test for adler32 (adler32-zlib)
[  156.040120] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:7372]
[  156.047450] Modules linked in: libcfs(OE+) rpcsec_gss_krb5 auth_rpcgss nfsv4
      dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common
      crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr i2c_piix4
      virtio_balloon sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata
      crc32c_intel serio_raw virtio_net net_failover failover virtio_blk
[  156.053201] CPU: 0 PID: 7372 Comm: modprobe Kdump: loaded 4.18.0-425.3.1.el8.x86_64 #1
[  156.055346] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  156.056463] RIP: 0010:native_safe_halt+0xe/0x20
[  156.073070] Call Trace:
[  156.073591]  kvm_wait+0x58/0x60
[  156.074244]  __pv_queued_spin_lock_slowpath+0x268/0x2a0
[  156.075265]  cfs_trace_unlock_tcd+0x20/0x70 [libcfs]
[  156.076265]  libcfs_debug_msg+0x98d/0xb10 [libcfs]
[  156.079433]  cfs_crypto_performance_test+0x202/0x320 [libcfs]
[  156.082984]  cfs_crypto_register+0x33/0x50 [libcfs]
[  156.083953]  libcfs_init+0x192/0x32e [libcfs]
[  156.084828]  do_one_initcall+0x46/0x1d0
[  156.087318]  do_init_module+0x5a/0x230
[  156.088074]  load_module+0x14bf/0x17f0

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/94335 - 4.18.0-425.3.1.el8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/94335 - 4.18.0-425.10.1.el8_lustre.x86_64



 Comments   
Comment by Arshad Hussain [ 16/May/23 ]

+1 on Master (https://testing.whamcloud.com/test_sessions/e66c0e5a-b8be-4637-b275-7d040277f4d9)

Comment by James A Simmons [ 17/May/23 ]

I wonder if https://review.whamcloud.com/#/c/fs/lustre-release/+/50992 will resolve this?

Comment by Andreas Dilger [ 05/Aug/23 ]

Just hit this again on master/x86_64 with the 50992 patch applied:
https://testing.whamcloud.com/test_sets/846a9c20-e17a-491d-92e9-a02ac7faa46b

In hindsight, that patch would relate to lockdep warnings, but this is due to too-long running crypto performance testing at module load time. We probably need to reduce the amount of data being checked and/or push the checks to a work queue (one work item per compression type, maybe a fee per type to get better stats) so that it doesn't block module loading.

This would also have the benefit of speeding up module loading and allow parallel execution while other modules are loaded. Just need to make sure work items are waited for or cancelled if module is unloaded quickly (though that should be rare).

Generated at Sat Feb 10 03:29:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.