[LU-17406] sanity-flr test_50A: watchdog: BUG: soft lockup - CPU#0 stuck for 22s Created: 09/Jan/24 Updated: 09/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: test_50A failed with the following error after both CPUs were locked up: onyx-99vm1 crash during sanity-flr test_50A Test session details: Lustre: DEBUG MARKER: == sanity-flr test 50A: mirror split update layout generation ===== 19:25:25 (1704741925) Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 Lustre: Failing over lustre-MDT0000 watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_bl_02:77462] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ldlm_bl_03:80014] CPU: 1 PID: 80014 Comm: ldlm_bl_03 4.18.0-477.27.1.el8_lustre.x86_64 #1 CPU: 0 PID: 77462 Comm: ldlm_bl_02 4.18.0-477.27.1.el8_lustre.x86_64 #1 RIP: 0010:cfs_hash_for_each_relax+0x17b/0x480 [libcfs] Call Trace: kvm_wait+0x58/0x60 __pv_queued_spin_lock_slowpath+0x268/0x2a0 cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs] ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc] _raw_spin_lock+0x1e/0x30 cfs_hash_for_each_relax+0x14a/0x480 [libcfs] cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs] ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc] ldlm_export_cancel_locks+0x172/0x180 [ptlrpc] ldlm_export_cancel_locks+0x172/0x180 [ptlrpc] ldlm_bl_thread_main+0x6df/0x940 [ptlrpc] ldlm_bl_thread_main+0x6df/0x940 [ptlrpc] kthread+0x134/0x150 kthread+0x134/0x150 ret_from_fork+0x35/0x40 ret_from_fork+0x35/0x40 The duplicate lines in the stack trace look like they are because both CPUs are printing to the console at the same time and both appear to be in ldlm_export_cancel_locks() and contending on the same spinlock. This similar stack also appeared in VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 09/Jan/24 ] |
|
I think that there shouldn't be more than one thread evicting a client at once, so there should be some kind of flag on the export that puts the other thread to sleep (or it just returns) while the first thread cancels all of the locks. |