[LU-17349] sanity-quota test_81: Kernel panic - not syncing: softlockup: hung tasks Created: 08/Dec/23  Updated: 09/Jan/24  Resolved: 20/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: Sergey Cheremencev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-17406 sanity-flr test_50A: watchdog: BUG: s... Open
is related to LU-17406 sanity-flr test_50A: watchdog: BUG: s... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

test_81 failed with the following error:

trevis-68vm7 crashed during sanity-quota test_81

Dmesg from vmcore:

[42692.257956] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-quota test 81: Race qmt_start_pool_recalc with qmt_pool_free =============================== 05:00:18 \(1700888418\)
[42692.536584] Lustre: DEBUG MARKER: == sanity-quota test 81: Race qmt_start_pool_recalc with qmt_pool_free =============================== 05:00:18 (1700888418)
[42692.749799] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
[42693.171476] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osp.*.destroys_in_flight
[42694.725049] Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
[42695.134885] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[42695.547239] Lustre: DEBUG MARKER: lfs --list-commands
[42695.915540] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=ugp
[42701.359171] Lustre: DEBUG MARKER: /usr/sbin/lctl mark User quota \(block hardlimit:20 MB\)
[42701.570888] Lustre: DEBUG MARKER: User quota (block hardlimit:20 MB)
[42701.765689] Lustre: DEBUG MARKER: lctl pool_new lustre.qpool1
[42704.157598] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.qpool1 				2>/dev/null || echo foo
[42704.557285] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x80000A07 fail_val=30
[42704.949475] Lustre: DEBUG MARKER: /usr/sbin/lctl pool_add lustre.qpool1 lustre-OST[0-0/1]
[42706.444500] LustreError: 12680:0:(fail.c:130:__cfs_fail_timeout_set()) cfs_fail_timeout id a07 sleeping for 30000ms
[42707.325568] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.qpool1 |
				sort -u | tr '\n' ' ' 
[42707.702378] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[42708.083111] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
[42711.531547] sched: RT throttling activated
[42734.489940] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [ldlm_bl_03:9064]
[42734.492361] Modules linked in: nfsd nfs_acl dm_flakey lustre(OE) obdecho(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper joydev ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi
[42734.888235]  virtio_blk virtio_net net_failover failover crct10dif_pclmul crct10dif_common crc32c_intel ata_piix serio_raw libata virtio_pci virtio_ring virtio floppy [last unloaded: dm_flakey]
[42734.926022] CPU: 1 PID: 9064 Comm: ldlm_bl_03 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.83.1.el7_lustre.ddn17.x86_64 #1
[42734.928412] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[42734.929490] task: ffff88382c5f1080 ti: ffff88382f584000 task.ti: ffff88382f584000
[42734.930874] RIP: 0010:[<ffffffff931b9e1b>]  [<ffffffff931b9e1b>] native_safe_halt+0xb/0x30
[42735.066856] RSP: 0018:ffff88382f587c50  EFLAGS: 00000202
[42735.067898] RAX: 0000000000000005 RBX: ffff88382f587c10 RCX: 0000000000000000
[42735.069221] RDX: 0000000000000005 RSI: 0000000000000005 RDI: ffff93b84581c000
[42735.070544] RBP: ffff88382f587c50 R08: 0000000000000000 R09: 00000000000000b0
[42735.071866] R10: 0000000000000000 R11: 0000000000000001 R12: 0000090900000010
[42735.073188] R13: 0000236800000000 R14: 000000000000154f R15: 0000000000533f52
[42735.074498] FS:  0000000000000000(0000) GS:ffff88387fd00000(0000) knlGS:0000000000000000
[42735.076001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42735.077070] CR2: 0000000000519001 CR3: 00000000bb992000 CR4: 00000000000606e0
[42735.089216] Call Trace:
[42735.097974]  [<ffffffff92a6fd95>] kvm_wait+0x65/0x70
[42735.103299]  [<ffffffff92b1e95e>] __pv_queued_spin_lock_slowpath+0x29e/0x2e0
[42735.111331]  [<ffffffff931ac21a>] queued_spin_lock_slowpath+0xb/0x13
[42735.112554]  [<ffffffff931ba734>] _raw_spin_lock+0x24/0x30
[42735.114265]  [<ffffffffc06cdec9>] cfs_hash_spin_lock+0x9/0x10 [libcfs]
[42735.115552]  [<ffffffffc06cea57>] cfs_hash_for_each_relax+0x147/0x450 [libcfs]
[42735.581459]  [<ffffffffc0bb9e30>] ? ldlm_lock_mode_downgrade+0x340/0x340 [ptlrpc]
[42735.582936]  [<ffffffffc0bb9e30>] ? ldlm_lock_mode_downgrade+0x340/0x340 [ptlrpc]
[42735.584409]  [<ffffffffc06d1ed5>] cfs_hash_for_each_nolock+0x75/0x1d0 [libcfs]
[42735.586451]  [<ffffffffc0bbac8c>] ldlm_reprocess_recovery_done+0x3c/0x120 [ptlrpc]
[42735.588445]  [<ffffffffc0bbba3c>] ldlm_export_cancel_locks+0x17c/0x190 [ptlrpc]
[42735.598443]  [<ffffffffc0be72f0>] ldlm_bl_thread_main+0x780/0xab0 [ptlrpc]
[42735.600524]  [<ffffffff92acc680>] ? wake_up_atomic_t+0x40/0x40
[42735.601684]  [<ffffffffc0be6b70>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc]
[42735.603122]  [<ffffffff92acb511>] kthread+0xd1/0xe0
[42735.604049]  [<ffffffff92acb440>] ? insert_kthread_work+0x40/0x40
[42735.605692]  [<ffffffff931c51f7>] ret_from_fork_nospec_begin+0x21/0x21
[42735.606909]  [<ffffffff92acb440>] ? insert_kthread_work+0x40/0x40
[42735.608052] Code: ff ff 66 0f 1f 44 00 00 0f 00 2d fd e7 04 00 e9 14 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 90 55 48 89 e5 66 66 66 66 90 fb f4 <5d> c3 cc cc cc cc 0f 1f 80 00 00 00 00 0f 00 2d 21 f8 04 00 eb 
[42735.613869] Kernel panic - not syncing: softlockup: hung tasks
[42735.614975] CPU: 1 PID: 9064 Comm: ldlm_bl_03 Kdump: loaded Tainted: G           OEL ------------   3.10.0-1160.83.1.el7_lustre.ddn17.x86_64 #1
[42735.617327] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[42735.618420] Call Trace:
[42735.618895]  <IRQ>  [<ffffffff931b1bec>] dump_stack+0x19/0x1f
[42735.620023]  [<ffffffff931ab708>] panic+0xe8/0x21f
[42735.626270]  [<ffffffff92b56f5a>] watchdog_timer_fn+0x20a/0x220
[42735.627420]  [<ffffffff92b56d50>] ? watchdog+0x50/0x50
[42735.628385]  [<ffffffff92acfa0e>] __hrtimer_run_queues+0x10e/0x270
[42735.629567]  [<ffffffff92acff9f>] hrtimer_interrupt+0xaf/0x1e0
[42735.636517]  [<ffffffff92a5fa3b>] local_apic_timer_interrupt+0x3b/0x70
[42735.637771]  [<ffffffff931ca053>] smp_apic_timer_interrupt+0x43/0x60
[42735.638948]  [<ffffffff931c63f2>] apic_timer_interrupt+0x172/0x180
[42735.640111]  <EOI>  [<ffffffff931b9e1b>] ? native_safe_halt+0xb/0x30
[42735.641339]  [<ffffffff92a6fd95>] kvm_wait+0x65/0x70
[42735.642265]  [<ffffffff92b1e95e>] __pv_queued_spin_lock_slowpath+0x29e/0x2e0
[42735.643592]  [<ffffffff931ac21a>] queued_spin_lock_slowpath+0xb/0x13
[42735.644781]  [<ffffffff931ba734>] _raw_spin_lock+0x24/0x30
[42735.645820]  [<ffffffffc06cdec9>] cfs_hash_spin_lock+0x9/0x10 [libcfs]
[42735.647043]  [<ffffffffc06cea57>] cfs_hash_for_each_relax+0x147/0x450 [libcfs]
[42735.648417]  [<ffffffffc0bb9e30>] ? ldlm_lock_mode_downgrade+0x340/0x340 [ptlrpc]
[42735.649822]  [<ffffffffc0bb9e30>] ? ldlm_lock_mode_downgrade+0x340/0x340 [ptlrpc]
[42735.651229]  [<ffffffffc06d1ed5>] cfs_hash_for_each_nolock+0x75/0x1d0 [libcfs]
[42735.652587]  [<ffffffffc0bbac8c>] ldlm_reprocess_recovery_done+0x3c/0x120 [ptlrpc]
[42735.654041]  [<ffffffffc0bbba3c>] ldlm_export_cancel_locks+0x17c/0x190 [ptlrpc]
[42735.655419]  [<ffffffffc0be72f0>] ldlm_bl_thread_main+0x780/0xab0 [ptlrpc]
[42735.656699]  [<ffffffff92acc680>] ? wake_up_atomic_t+0x40/0x40
[42735.657818]  [<ffffffffc0be6b70>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc]
[42735.659216]  [<ffffffff92acb511>] kthread+0xd1/0xe0
[42735.660139]  [<ffffffff92acb440>] ? insert_kthread_work+0x40/0x40
[42735.661276]  [<ffffffff931c51f7>] ret_from_fork_nospec_begin+0x21/0x21
[42735.662488]  [<ffffffff92acb440>] ? insert_kthread_work+0x40/0x40 


 Comments   
Comment by Gerrit Updater [ 08/Dec/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53384
Subject: LU-17349 tests: sanity-quota_81 decrease timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e82c155c83c9aae3191ef040103e223497dce5ad

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53384/
Subject: LU-17349 tests: sanity-quota_81 decrease timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b58219ef1edebcb266cbe0dfede491ba5de491d1

Comment by Peter Jones [ 20/Dec/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 09/Jan/24 ]

Hit another lockup (which triggered a crash) like this in sanity-flr:
https://testing.whamcloud.com/test_sets/112570ae-2e64-4c60-bd13-b1447c7934fa

[ 5945.178891] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ldlm_bl_02:77462]
[ 5945.182887] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ldlm_bl_03:80014]

I've filed that as a separate bug in LU-17406, but it looks very similar to this one. I think there is a problem with threads contending on a DLM spinlock.

Generated at Sat Feb 10 03:34:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.