[LU-16343] soft lockups ptlrpcd Created: 25/Nov/22 Updated: 30/Jun/23 Resolved: 27/Jan/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dneg (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531 full version: 2.12.8_6_g5457c37-1.el7 Can you let me know what debugging options I should turn on to get the info needed to diagnose the issue. |
| Comments |
| Comment by Alex Zhuravlev [ 25/Nov/22 ] |
|
any stack trace following that message? |
| Comment by Dneg (Inactive) [ 25/Nov/22 ] |
|
Hi Alex, yes, sorry, pasted below:
Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]
Nov 9 03:11:25 foxtrot3 kernel: Modules linked in: rpcsec_gss_krb5 vfat fat mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase nfsv3 nfs fscache mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) dell_rbu libcfs(OE) binfmt_misc bonding iTCO_wdt iTCO_vendor_support dcdbas joydev sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass sg ipmi_si ipmi_devintf ipmi_msghandler acpi_pad wmi acpi_power_meter mei_me mei lpc_ich nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common bnx2x crc32_pclmul ahci crc32c_intel scsi_transport_iscsi ghash_clmulni_intel libahci
Nov 9 03:11:25 foxtrot3 kernel: drm aesni_intel libata lrw gf128mul glue_helper ablk_helper cryptd megaraid_sas drm_panel_orientation_quirks dm_multipath ptp pps_core mdio libcrc32c sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage]
Nov 9 03:11:25 foxtrot3 kernel: CPU: 23 PID: 3531 Comm: ptlrpcd_01_10 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.49.1.el7.x86_64 #1
Nov 9 03:11:25 foxtrot3 kernel: Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 2.5.4 01/22/2016
Nov 9 03:11:25 foxtrot3 kernel: task: ffff9b81766a5280 ti: ffff9b817f080000 task.ti: ffff9b817f080000
Nov 9 03:11:25 foxtrot3 kernel: RIP: 0010:[<ffffffffa3b17aa2>] [<ffffffffa3b17aa2>] native_queued_spin_lock_slowpath+0x122/0x200
Nov 9 03:11:25 foxtrot3 kernel: RSP: 0018:ffff9b817f083ad0 EFLAGS: 00000246
Nov 9 03:11:25 foxtrot3 kernel: RAX: 0000000000000000 RBX: ffff9b8ab3ab6a00 RCX: 0000000000b90000
Nov 9 03:11:25 foxtrot3 kernel: RDX: ffff9ba17f15b8c0 RSI: 0000000000590001 RDI: ffff9b95f1d56de4
Nov 9 03:11:25 foxtrot3 kernel: RBP: ffff9b817f083ad0 R08: ffff9ba17f2db8c0 R09: 0000000000000000
Nov 9 03:11:25 foxtrot3 kernel: R10: 0000000000000001 R11: ffff9b8ab3ab6a00 R12: ffffffffc09175f8
Nov 9 03:11:25 foxtrot3 kernel: R13: ffff9b8ab3ab6a00 R14: ffff9b817e897000 R15: ffffffffa3c26900
Nov 9 03:11:25 foxtrot3 kernel: FS: 0000000000000000(0000) GS:ffff9ba17f2c0000(0000) knlGS:0000000000000000
Nov 9 03:11:25 foxtrot3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 9 03:11:25 foxtrot3 kernel: CR2: 000055f56a905fb8 CR3: 000000293f08e000 CR4: 00000000000607e0
Nov 9 03:11:25 foxtrot3 kernel: Call Trace:
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa417dcf3>] queued_spin_lock_slowpath+0xb/0xf
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa418baa0>] _raw_spin_lock+0x20/0x30
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc1041bec>] osc_page_delete+0x1fc/0x500 [osc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0ce1550>] cl_page_delete0+0x80/0x220 [obdclass]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0ce1723>] cl_page_delete+0x33/0x110 [obdclass]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc1041861>] discard_pagevec+0x91/0x130 [osc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc104263a>] osc_lru_shrink+0x74a/0x7c0 [osc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc104364c>] lru_queue_work+0x4c/0x230 [osc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0eae31a>] work_interpreter+0x3a/0xf0 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0eab231>] ptlrpc_check_set.part.23+0x481/0x1dd0 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa3ae26ec>] ? set_next_entity+0x3c/0xe0
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0eacbdb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed810b>] ptlrpcd_check+0x4ab/0x590 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed84f0>] ptlrpcd+0x300/0x560 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa3adadf0>] ? wake_up_state+0x20/0x20
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffc0ed81f0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5e61>] kthread+0xd1/0xe0
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5d90>] ? insert_kthread_work+0x40/0x40
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa4195df7>] ret_from_fork_nospec_begin+0x21/0x21
Nov 9 03:11:25 foxtrot3 kernel: [<ffffffffa3ac5d90>] ? insert_kthread_work+0x40/0x40
Nov 9 03:11:25 foxtrot3 kernel: Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 c0 b8 01 00 48 03 14 c5 60 15 75 a4 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b
|
| Comment by Dneg (Inactive) [ 14/Dec/22 ] |
|
Hi Alex, Do you need any further information? Kind regards, |
| Comment by Alex Zhuravlev [ 19/Dec/22 ] |
|
can you please attach full dmesg/syslog output? probably something bad happened before. |
| Comment by Dneg (Inactive) [ 29/Dec/22 ] |
|
Hi Alex, ful syslog file attached |
| Comment by Alex Zhuravlev [ 14/Jan/23 ] |
|
dneg thanks for the log. unfortuntely the log has the only trace, so I can't idenfity another thread holding the spinlock. ideally we need a crashdump or full set of traces (echo t >/proc/sysrq-trigger) to be able to find which process was holding the spinlock so blocking ptlrpcd. |
| Comment by Dneg (Inactive) [ 16/Jan/23 ] |
|
Hi Alex, We have had only one ptlrpcd lockup since the beginning of December last year. I think we could close this ticket for now, and open a new one if needed Thanks, |