[LU-16180] lustre 2.14.0_ddn54 + 5.15 kernel soft cpu lockups Created: 22/Sep/22  Updated: 27/Oct/22  Resolved: 16/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing Lustre client code against a 5.15 kernel system, soft cpu lockups were caused when doing some FIO based tests:

[Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734]
[Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1
[Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014
[Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30
[Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48
[Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206
[Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff
[Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050
[Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557
[Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000
[Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002
[Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000
[Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0
[Wed Aug 10 13:40:59 2022] Call Trace:
[Wed Aug 10 13:40:59 2022] <TASK>
[Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220
[Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90
[Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc]
[Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc]
[Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40
[Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc]
[Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70
[Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc]

This is pretty reproducible by just running fio and doing buffered writes.



 Comments   
Comment by Gerrit Updater [ 22/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48629
Subject: LU-16180 ptlrpc: add cond_resched after ptlrpc_free_request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 50feeddfc0c54720b87735c8c2eba6a98d00b7a4

Comment by Gerrit Updater [ 15/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48629/
Subject: LU-16180 ptlrpc: reduce lock contention in ptlrpc_free_committed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d3074511f3ee322d841c0c0e7f644422e85a543e

Comment by Peter Jones [ 16/Oct/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:24:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.