Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3596

deadlock in kiblnd

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.9
    • None
    • 3
    • 9116

    Description

      We had a few OSSes crash at NOAA recently due to an NMI deadlock detected error. I was able to get a vmcore from one and analyze it, and it looks like we are hitting LU-78:
      crash> bt
      PID: 15515 TASK: ffff810c39c237e0 CPU: 0 COMMAND: "kiblnd_connd"
      #0 [ffffffff804b8dc0] crash_kexec at ffffffff800b1192
      #1 [ffffffff804b8e80] die_nmi at ffffffff80065285
      #2 [ffffffff804b8ea0] nmi_watchdog_tick at ffffffff80065a66
      #3 [ffffffff804b8ef0] default_do_nmi at ffffffff80065609
      #4 [ffffffff804b8f40] do_nmi at ffffffff800658f1
      #5 [ffffffff804b8f50] nmi at ffffffff80064ecf
      [exception RIP: __write_lock_failed+15]
      RIP: ffffffff80062197 RSP: ffff81061978dc90 RFLAGS: 00000087
      RAX: ffffc20000000000 RBX: ffff8107eb940140 RCX: 0000000000000001
      RDX: 0000000000006000 RSI: 0000000000000003 RDI: ffffffff8032d42c
      RBP: ffffc20000000000 R8: 0000000000000000 R9: ffff810c39c237e0
      R10: ffff8105645d2000 R11: 0000000000002000 R12: ffffffffffffffff
      R13: 0000000000007000 R14: 0000000000000001 R15: 0000000000000002
      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
      — <NMI exception stack> —
      #6 [ffff81061978dc90] __write_lock_failed at ffffffff80062197
      #7 [ffff81061978dc90] _write_lock at ffffffff80064a7d
      #8 [ffff81061978dc98] __get_vm_area_node at ffffffff800d51c1
      #9 [ffff81061978dcd8] __vmalloc_node at ffffffff800d5952
      #10 [ffff81061978dcf8] kiblnd_create_tx_pool at ffffffff8b127e0e
      #11 [ffff81061978dd68] kiblnd_pool_alloc_node at ffffffff8b1247a9
      #12 [ffff81061978ddc8] kiblnd_get_idle_tx at ffffffff8b12d8d0
      #13 [ffff81061978ddd8] kiblnd_check_sends at ffffffff8b12e9b7
      #14 [ffff81061978dde8] kiblnd_check_txs at ffffffff8b12c22c
      #15 [ffff81061978de48] kiblnd_check_conns at ffffffff8b12ea68
      #16 [ffff81061978dea8] kiblnd_connd at ffffffff8b136063
      #17 [ffff81061978df48] kernel_thread at ffffffff8005dfc1

      It appears that LU-78 was identified for 1.8.x, but never landed on it. If this crash is related to that bug, would it be possible to get an updated patch for potential inclusion on the next 1.8?

      Thanks.

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            orentas Oz Rentas
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: