Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15510

Soft locks on OSS servers with fail over with MDS.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.6
    • OSS server running RHEL7 3.10.0-1160.49.1.el7.x86_64 with ZFS 2.0.7.
      Lustre version 2.12.6
    • 3
    • 9223372036854775807

    Description

      Communication between the MDS and OSS servers failed so recovery started (IR is disabled"). The recovery on the OSS server failed with a lock up:

      NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ll_ost_io01_074:30838]                   

      With stack trace:

      [95181.855433] CPU: 3 PID: 30807 Comm: ll_ost_io01_070 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
      [95181.855433] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [95181.855434] task: ffff93abd41c6300 ti: ffff93abdc84c000 task.ti: ffff93abdc84c000
      [95181.855435] RIP: 0010:[<ffffffff85f17aa2>]  
      [95181.855441]  [<ffffffff85f17aa2>] native_queued_spin_lock_slowpath+0x122/0x200
      [95181.855441] RSP: 0018:ffff93abdc84fcb8  EFLAGS: 00000246
      [95181.855442] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000190000
      [95181.855443] RDX: ffff93c4dd69b8c0 RSI: 0000000000290000 RDI: ffff93c367157830
      [95181.855443] RBP: ffff93abdc84fcb8 R08: ffff93c4dd65b8c0 R09: 0000000000000000
      [95181.855444] R10: ffff93c4dd65f160 R11: fffff40dfe6a9200 R12: ffff93abdc84fc58
      [95181.855444] R13: ffff93c3b5c66000 R14: ffff93b8aaa89850 R15: ffffffffc17a7b96
      [95181.855445] FS:  0000000000000000(0000) GS:ffff93c4dd640000(0000) knlGS:0000000000000000
      [95181.855446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [95181.855447] CR2: 000000c002f51000 CR3: 0000002fd1d42000 CR4: 00000000007607e0
      [95181.855448] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [95181.855448] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [95181.855449] PKRU: 00000000
      [95181.855449] Call Trace:
      [95181.855454]  [<ffffffff8657dcf3>] queued_spin_lock_slowpath+0xb/0xf
      [95181.855459]  [<ffffffff8658baa0>] _raw_spin_lock+0x20/0x30
      [95181.855518]  [<ffffffffc1463232>] ptlrpc_server_drop_request+0x1c2/0x6d0 [ptlrpc]
      [95181.855545]  [<ffffffffc14637d2>] ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc]
      [95181.855572]  [<ffffffffc1465a41>] ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
      [95181.855597]  [<ffffffffc14626a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [95181.855600]  [<ffffffff85ed3233>] ? __wake_up+0x13/0x20
      [95181.855625]  [<ffffffffc14691f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [95181.855650]  [<ffffffffc14686c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [95181.855653]  [<ffffffff85ec5e61>] kthread+0xd1/0xe0
      [95181.855655]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40
      [95181.855657]  [<ffffffff86595ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [95181.855659]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40

      Attachments

        Activity

          [LU-15510] Soft locks on OSS servers with fail over with MDS.
          simmonsja James A Simmons made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

          Created more CPTs to reduce the contention.

          simmonsja James A Simmons added a comment - Created more CPTs to reduce the contention.

          Peter, 

                 To answer your question from above, this is an operational issue that impacted production. We reverted the code change so things are now stable. I know that there were patches applied to 2.12.6, but I'm not sure what they are. James would know the details. 

           

          Thanks,

          Dustin 

          dustb100 Dustin Leverman added a comment - Peter,         To answer your question from above, this is an operational issue that impacted production. We reverted the code change so things are now stable. I know that there were patches applied to 2.12.6, but I'm not sure what they are. James would know the details.    Thanks, Dustin 

          We just hit this bug again.  Currently our CPT looks like

          0:    0 2 4  6 8 19 12 141 16 18 20 22

          1:   1 3 5 7 9 11 13 15 17 19 21 23

          I wonder if doubling the CPT count would lower the lock contention.

          simmonsja James A Simmons added a comment - We just hit this bug again.  Currently our CPT looks like 0:    0 2 4  6 8 19 12 141 16 18 20 22 1:   1 3 5 7 9 11 13 15 17 19 21 23 I wonder if doubling the CPT count would lower the lock contention.

          I never seen this bug before so I was hoping you ran into before.

          simmonsja James A Simmons added a comment - I never seen this bug before so I was hoping you ran into before.
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Peter Jones [ pjones ]
          pjones Peter Jones added a comment -

          James

          Is this something that you are working on or an operational issue? If the latter, are there any logs available? Are any patches applied to the vanilla 2.12.6 release?

          Peter

          pjones Peter Jones added a comment - James Is this something that you are working on or an operational issue? If the latter, are there any logs available? Are any patches applied to the vanilla 2.12.6 release? Peter
          simmonsja James A Simmons made changes -
          Labels Original: ORNL New: ORNL ornl
          simmonsja James A Simmons made changes -
          Labels New: ORNL
          simmonsja James A Simmons created issue -

          People

            pjones Peter Jones
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: