Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
While testing Lustre client code against a 5.15 kernel system, soft cpu lockups were caused when doing some FIO based tests:
[Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734] [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1 [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014 [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30 [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206 [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050 [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557 [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000 [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002 [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000 [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0 [Wed Aug 10 13:40:59 2022] Call Trace: [Wed Aug 10 13:40:59 2022] <TASK> [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220 [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90 [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc] [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc] [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40 [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70 [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc]
This is pretty reproducible by just running fio and doing buffered writes.
Attachments
Activity
Fix Version/s | New: Lustre 2.16.0 [ 15190 ] | |
Resolution | New: Fixed [ 1 ] | |
Status | Original: In Progress [ 3 ] | New: Resolved [ 5 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Jian Yu [ yujian ] |
Key |
Original:
|
New:
|
Affects Version/s | New: Lustre 2.16.0 [ 15190 ] | |
Affects Version/s | Original: ES6.1.0 [ 15395 ] | |
Workflow | Original: Software Simplified Workflow for Project EX [ 89876 ] | New: Sub-task Blocking [ 89880 ] |
Project | Original: Exascaler [ 12911 ] | New: Lustre [ 10000 ] |
Status | Original: To Do [ 10206 ] | New: In Progress [ 3 ] |
Description |
Original:
We are testing the latest 2.14.0_ddn54 client code against a 5.15 kernel system, and when doing some FIO based tests, we are able to quite easily cause soft cpu lockups, i.e.
{noformat} [Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734] [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1 [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014 [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30 [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206 [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050 [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557 [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000 [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002 [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000 [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0 [Wed Aug 10 13:40:59 2022] Call Trace: [Wed Aug 10 13:40:59 2022] <TASK> [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220 [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90 [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc] [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc] [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40 [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70 [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc] {noformat} This is pretty reproducible by just running fio and doing buffered writes. I'll add more detail but here's some quick client info: {noformat} scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ lctl get_param version version=2.14.0_ddn54 scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ uname -r 5.15.43.hrtdev scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ cat /etc/debian_version 10.12 {noformat} I am not able to reproduce these problems on older lustre versions, but we cannot run < 2.14.0_ddn54 on the 5.15 kernel due to https://jira.whamcloud.com/browse/LU-15933 -Steve |
New:
While testing Lustre client code against a 5.15 kernel system, soft cpu lockups were caused when doing some FIO based tests:
{noformat} [Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734] [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1 [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014 [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30 [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206 [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050 [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557 [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000 [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002 [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000 [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0 [Wed Aug 10 13:40:59 2022] Call Trace: [Wed Aug 10 13:40:59 2022] <TASK> [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220 [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90 [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc] [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc] [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc] [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40 [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc] [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70 [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc] {noformat} This is pretty reproducible by just running fio and doing buffered writes. |
Link | New: This issue is related to DDN-3288 [ DDN-3288 ] |