Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16180

lustre 2.14.0_ddn54 + 5.15 kernel soft cpu lockups

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      While testing Lustre client code against a 5.15 kernel system, soft cpu lockups were caused when doing some FIO based tests:

      [Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734]
      [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1
      [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014
      [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30
      [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48
      [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206
      [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff
      [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050
      [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557
      [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000
      [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002
      [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000
      [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0
      [Wed Aug 10 13:40:59 2022] Call Trace:
      [Wed Aug 10 13:40:59 2022] <TASK>
      [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220
      [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90
      [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40
      [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc]
      [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70
      [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc]
      

      This is pretty reproducible by just running fio and doing buffered writes.

      Attachments

        Activity

          [LU-16180] lustre 2.14.0_ddn54 + 5.15 kernel soft cpu lockups
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.16.0 [ 15190 ]
          Resolution New: Fixed [ 1 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
          yujian Jian Yu made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Jian Yu [ yujian ]
          yujian Jian Yu made changes -
          Key Original: EX-5995 New: LU-16180
          Affects Version/s New: Lustre 2.16.0 [ 15190 ]
          Affects Version/s Original: ES6.1.0 [ 15395 ]
          Workflow Original: Software Simplified Workflow for Project EX [ 89876 ] New: Sub-task Blocking [ 89880 ]
          Project Original: Exascaler [ 12911 ] New: Lustre [ 10000 ]
          Status Original: To Do [ 10206 ] New: In Progress [ 3 ]
          yujian Jian Yu made changes -
          Description Original: We are testing the latest 2.14.0_ddn54 client code against a 5.15 kernel system, and when doing some FIO based tests, we are able to quite easily cause soft cpu lockups, i.e.

          {noformat}
          [Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734]
          [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1
          [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014
          [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30
          [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48
          [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206
          [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff
          [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050
          [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557
          [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000
          [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002
          [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000
          [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0
          [Wed Aug 10 13:40:59 2022] Call Trace:
          [Wed Aug 10 13:40:59 2022] <TASK>
          [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220
          [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90
          [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40
          [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70
          [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc]
          {noformat}

          This is pretty reproducible by just running fio and doing buffered writes. I'll add more detail but here's some quick client info:

          {noformat}
          scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ lctl get_param version
          version=2.14.0_ddn54
          scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ uname -r
          5.15.43.hrtdev
          scrusan@shalinux-vmrome7:/tmp/lustre_debug.shalinux-vmrome7.202208091956.sAX$ cat /etc/debian_version
          10.12
          {noformat}

          I am not able to reproduce these problems on older lustre versions, but we cannot run < 2.14.0_ddn54 on the 5.15 kernel due to https://jira.whamcloud.com/browse/LU-15933

          -Steve
          New: While testing Lustre client code against a 5.15 kernel system, soft cpu lockups were caused when doing some FIO based tests:
          {noformat}
          [Wed Aug 10 13:40:59 2022] watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [ptlrpcd_04_01:1734]
          [Wed Aug 10 13:40:59 2022] CPU: 9 PID: 1734 Comm: ptlrpcd_04_01 Tainted: G O L 5.15.43.hrtdev #1
          [Wed Aug 10 13:40:59 2022] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014
          [Wed Aug 10 13:40:59 2022] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30
          [Wed Aug 10 13:40:59 2022] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 02 5d c3 fb 66 0f 1f 44 00 00 <5d> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48
          [Wed Aug 10 13:40:59 2022] RSP: 0018:ffffb91341a4bba8 EFLAGS: 00000206
          [Wed Aug 10 13:40:59 2022] RAX: ffffe4a150371b80 RBX: ffffe4a150371b80 RCX: 00000000ffffffff
          [Wed Aug 10 13:40:59 2022] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffa406f4e9b050
          [Wed Aug 10 13:40:59 2022] RBP: ffffb91341a4bba8 R08: 000000000000007d R09: 00000000000b6557
          [Wed Aug 10 13:40:59 2022] R10: 0000000000000009 R11: ffffb91341a4bb78 R12: ffffa406f4e9b000
          [Wed Aug 10 13:40:59 2022] R13: 0000000000000002 R14: 0000000000000003 R15: 0000000000000002
          [Wed Aug 10 13:40:59 2022] FS: 0000000000000000(0000) GS:ffffa40d51a40000(0000) knlGS:0000000000000000
          [Wed Aug 10 13:40:59 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [Wed Aug 10 13:40:59 2022] CR2: 0000000000e319cc CR3: 00000001099dc000 CR4: 00000000003506e0
          [Wed Aug 10 13:40:59 2022] Call Trace:
          [Wed Aug 10 13:40:59 2022] <TASK>
          [Wed Aug 10 13:40:59 2022] __page_cache_release+0x1d5/0x220
          [Wed Aug 10 13:40:59 2022] __put_page+0x3a/0x90
          [Wed Aug 10 13:40:59 2022] ptlrpc_release_bulk_page_pin+0x51/0x90 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_bulk+0x95/0x500 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] __ptlrpc_req_finished+0x350/0x730 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_request+0x65/0x70 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_free_committed+0x110/0x6f0 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] after_reply+0x8ea/0xd80 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpc_check_set+0xb29/0x1c90 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ptlrpcd_check+0x399/0x580 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ? timer_update_keys+0x40/0x40
          [Wed Aug 10 13:40:59 2022] ptlrpcd+0x3c9/0x4d0 [ptlrpc]
          [Wed Aug 10 13:40:59 2022] ? wait_woken+0x70/0x70
          [Wed Aug 10 13:40:59 2022] ? ptlrpcd_check+0x580/0x580 [ptlrpc]
          {noformat}
          This is pretty reproducible by just running fio and doing buffered writes.
          yujian Jian Yu made changes -
          Link New: This issue is related to DDN-3288 [ DDN-3288 ]
          yujian Jian Yu created issue -

          People

            yujian Jian Yu
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: