[LU-16343] soft lockups ptlrpcd Created: 25/Nov/22  Updated: 30/Jun/23  Resolved: 27/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dneg (Inactive) Assignee: Alex Zhuravlev
Resolution: Incomplete Votes: 0
Labels: None

Attachments: File 2022-11-09-syslog.log.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531

full version: 2.12.8_6_g5457c37-1.el7

Can you let me know what debugging options I should turn on to get the info needed to diagnose the issue.



 Comments   
Comment by Alex Zhuravlev [ 25/Nov/22 ]

any stack trace following that message?

Comment by Dneg (Inactive) [ 25/Nov/22 ]

Hi Alex, yes, sorry, pasted below:

Nov  9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]
Nov  9 03:11:25 foxtrot3 kernel: Modules linked in: rpcsec_gss_krb5 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase nfsv3 nfs fscache mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) dell_rbu libcfs(OE) binfmt_misc bonding iTCO_wdt iTCO_vendor_support dcdbas joydev sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass sg ipmi_si ipmi_devintf ipmi_msghandler acpi_pad wmi acpi_power_meter mei_me mei lpc_ich nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common bnx2x crc32_pclmul ahci crc32c_intel scsi_transport_iscsi ghash_clmulni_intel libahci
Nov  9 03:11:25 foxtrot3 kernel: drm aesni_intel libata lrw gf128mul glue_helper ablk_helper cryptd megaraid_sas drm_panel_orientation_quirks dm_multipath ptp pps_core mdio libcrc32c sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage]
Nov  9 03:11:25 foxtrot3 kernel: CPU: 23 PID: 3531 Comm: ptlrpcd_01_10 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.49.1.el7.x86_64 #1
Nov  9 03:11:25 foxtrot3 kernel: Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 2.5.4 01/22/2016
Nov  9 03:11:25 foxtrot3 kernel: task: ffff9b81766a5280 ti: ffff9b817f080000 task.ti: ffff9b817f080000
Nov  9 03:11:25 foxtrot3 kernel: RIP: 0010:[<ffffffffa3b17aa2>]  [<ffffffffa3b17aa2>] native_queued_spin_lock_slowpath+0x122/0x200
Nov  9 03:11:25 foxtrot3 kernel: RSP: 0018:ffff9b817f083ad0  EFLAGS: 00000246
Nov  9 03:11:25 foxtrot3 kernel: RAX: 0000000000000000 RBX: ffff9b8ab3ab6a00 RCX: 0000000000b90000
Nov  9 03:11:25 foxtrot3 kernel: RDX: ffff9ba17f15b8c0 RSI: 0000000000590001 RDI: ffff9b95f1d56de4
Nov  9 03:11:25 foxtrot3 kernel: RBP: ffff9b817f083ad0 R08: ffff9ba17f2db8c0 R09: 0000000000000000
Nov  9 03:11:25 foxtrot3 kernel: R10: 0000000000000001 R11: ffff9b8ab3ab6a00 R12: ffffffffc09175f8
Nov  9 03:11:25 foxtrot3 kernel: R13: ffff9b8ab3ab6a00 R14: ffff9b817e897000 R15: ffffffffa3c26900
Nov  9 03:11:25 foxtrot3 kernel: FS:  0000000000000000(0000) GS:ffff9ba17f2c0000(0000) knlGS:0000000000000000
Nov  9 03:11:25 foxtrot3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov  9 03:11:25 foxtrot3 kernel: CR2: 000055f56a905fb8 CR3: 000000293f08e000 CR4: 00000000000607e0
Nov  9 03:11:25 foxtrot3 kernel: Call Trace:
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa417dcf3>] queued_spin_lock_slowpath+0xb/0xf
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa418baa0>] _raw_spin_lock+0x20/0x30
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc1041bec>] osc_page_delete+0x1fc/0x500 [osc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0ce1550>] cl_page_delete0+0x80/0x220 [obdclass]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0ce1723>] cl_page_delete+0x33/0x110 [obdclass]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc1041861>] discard_pagevec+0x91/0x130 [osc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc104263a>] osc_lru_shrink+0x74a/0x7c0 [osc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc104364c>] lru_queue_work+0x4c/0x230 [osc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0eae31a>] work_interpreter+0x3a/0xf0 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0eab231>] ptlrpc_check_set.part.23+0x481/0x1dd0 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa3ae26ec>] ? set_next_entity+0x3c/0xe0
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0eacbdb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed810b>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed84f0>] ptlrpcd+0x300/0x560 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa3adadf0>] ? wake_up_state+0x20/0x20
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed81f0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5e61>] kthread+0xd1/0xe0
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5d90>] ? insert_kthread_work+0x40/0x40
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa4195df7>] ret_from_fork_nospec_begin+0x21/0x21
Nov  9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5d90>] ? insert_kthread_work+0x40/0x40
Nov  9 03:11:25 foxtrot3 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 c0 b8 01 00 48 03 14 c5 60 15 75 a4 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b
Comment by Dneg (Inactive) [ 14/Dec/22 ]

Hi Alex,

Do you need any further information?

Kind regards,
Campbell

Comment by Alex Zhuravlev [ 19/Dec/22 ]

can you please attach full dmesg/syslog output? probably something bad happened before.

Comment by Dneg (Inactive) [ 29/Dec/22 ]

Hi Alex,

ful syslog file attached

Comment by Alex Zhuravlev [ 14/Jan/23 ]

dneg thanks for the log. unfortuntely the log has the only trace, so I can't idenfity another thread holding the spinlock. ideally we need a crashdump or full set of traces (echo t >/proc/sysrq-trigger) to be able to find which process was holding the spinlock so blocking ptlrpcd.

Comment by Dneg (Inactive) [ 16/Jan/23 ]

Hi Alex,

We have had only one ptlrpcd lockup since the beginning of December last year. I think we could close this ticket for now, and open a new one if needed

Thanks,
Campbell

Generated at Sat Feb 10 03:26:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.