[LU-17406] sanity-flr test_50A: watchdog: BUG: soft lockup - CPU#0 stuck for 22s Created: 09/Jan/24  Updated: 09/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17349 sanity-quota test_81: Kernel panic - ... Resolved
is related to LU-17349 sanity-quota test_81: Kernel panic - ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/112570ae-2e64-4c60-bd13-b1447c7934fa

test_50A failed with the following error after both CPUs were locked up:

onyx-99vm1 crash during sanity-flr test_50A

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/101181 - 4.18.0-477.27.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/101181 - 4.18.0-477.27.1.el8_lustre.x86_64

 Lustre: DEBUG MARKER: == sanity-flr test 50A: mirror split update layout generation ===== 19:25:25 (1704741925)
 Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
 Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
 Lustre: Failing over lustre-MDT0000
 watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_bl_02:77462]
 watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ldlm_bl_03:80014]
 CPU: 1 PID: 80014 Comm: ldlm_bl_03 4.18.0-477.27.1.el8_lustre.x86_64 #1
 CPU: 0 PID: 77462 Comm: ldlm_bl_02 4.18.0-477.27.1.el8_lustre.x86_64 #1
 RIP: 0010:cfs_hash_for_each_relax+0x17b/0x480 [libcfs]
 Call Trace:
  kvm_wait+0x58/0x60
  __pv_queued_spin_lock_slowpath+0x268/0x2a0
  cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs]
  ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc]
  _raw_spin_lock+0x1e/0x30
  cfs_hash_for_each_relax+0x14a/0x480 [libcfs]
  cfs_hash_for_each_nolock+0x126/0x1f0 [libcfs]
  ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc]
  ldlm_export_cancel_locks+0x172/0x180 [ptlrpc]
  ldlm_export_cancel_locks+0x172/0x180 [ptlrpc]
  ldlm_bl_thread_main+0x6df/0x940 [ptlrpc]
  ldlm_bl_thread_main+0x6df/0x940 [ptlrpc]
  kthread+0x134/0x150
  kthread+0x134/0x150
  ret_from_fork+0x35/0x40
  ret_from_fork+0x35/0x40

The duplicate lines in the stack trace look like they are because both CPUs are printing to the console at the same time and both appear to be in ldlm_export_cancel_locks() and contending on the same spinlock.

This similar stack also appeared in LU-17349.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-flr test_50A - onyx-99vm1 crashed during sanity-flr test_50A



 Comments   
Comment by Andreas Dilger [ 09/Jan/24 ]

I think that there shouldn't be more than one thread evicting a client at once, so there should be some kind of flag on the export that puts the other thread to sleep (or it just returns) while the first thread cancels all of the locks.

Generated at Sat Feb 10 03:35:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.