Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15510

Soft locks on OSS servers with fail over with MDS.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.6
    • OSS server running RHEL7 3.10.0-1160.49.1.el7.x86_64 with ZFS 2.0.7.
      Lustre version 2.12.6
    • 3
    • 9223372036854775807

    Description

      Communication between the MDS and OSS servers failed so recovery started (IR is disabled"). The recovery on the OSS server failed with a lock up:

      NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ll_ost_io01_074:30838]                   

      With stack trace:

      [95181.855433] CPU: 3 PID: 30807 Comm: ll_ost_io01_070 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
      [95181.855433] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [95181.855434] task: ffff93abd41c6300 ti: ffff93abdc84c000 task.ti: ffff93abdc84c000
      [95181.855435] RIP: 0010:[<ffffffff85f17aa2>]  
      [95181.855441]  [<ffffffff85f17aa2>] native_queued_spin_lock_slowpath+0x122/0x200
      [95181.855441] RSP: 0018:ffff93abdc84fcb8  EFLAGS: 00000246
      [95181.855442] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000190000
      [95181.855443] RDX: ffff93c4dd69b8c0 RSI: 0000000000290000 RDI: ffff93c367157830
      [95181.855443] RBP: ffff93abdc84fcb8 R08: ffff93c4dd65b8c0 R09: 0000000000000000
      [95181.855444] R10: ffff93c4dd65f160 R11: fffff40dfe6a9200 R12: ffff93abdc84fc58
      [95181.855444] R13: ffff93c3b5c66000 R14: ffff93b8aaa89850 R15: ffffffffc17a7b96
      [95181.855445] FS:  0000000000000000(0000) GS:ffff93c4dd640000(0000) knlGS:0000000000000000
      [95181.855446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [95181.855447] CR2: 000000c002f51000 CR3: 0000002fd1d42000 CR4: 00000000007607e0
      [95181.855448] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [95181.855448] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [95181.855449] PKRU: 00000000
      [95181.855449] Call Trace:
      [95181.855454]  [<ffffffff8657dcf3>] queued_spin_lock_slowpath+0xb/0xf
      [95181.855459]  [<ffffffff8658baa0>] _raw_spin_lock+0x20/0x30
      [95181.855518]  [<ffffffffc1463232>] ptlrpc_server_drop_request+0x1c2/0x6d0 [ptlrpc]
      [95181.855545]  [<ffffffffc14637d2>] ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc]
      [95181.855572]  [<ffffffffc1465a41>] ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
      [95181.855597]  [<ffffffffc14626a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [95181.855600]  [<ffffffff85ed3233>] ? __wake_up+0x13/0x20
      [95181.855625]  [<ffffffffc14691f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [95181.855650]  [<ffffffffc14686c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [95181.855653]  [<ffffffff85ec5e61>] kthread+0xd1/0xe0
      [95181.855655]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40
      [95181.855657]  [<ffffffff86595ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [95181.855659]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40

      Attachments

        Activity

          People

            pjones Peter Jones
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: