Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6991

LBUG in ptlrpc_connection_put(): ASSERTION( atomic_read(&conn->c_refcount) > 1 )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.4
    • None
    • RHEL6.6 + Lustre 2.5.3.90 w/ bull patches
    • 3
    • 9223372036854775807

    Description

      We hit the following LBUG on one of our MDS.

      <3>LustreError: 12255:0:(sec.c:379:import_sec_validate_get()) import ffff880903ac1000 (FULL) with no sec
      <0>LustreError: 19842:0:(connection.c:104:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
      <0>LustreError: 19842:0:(connection.c:104:ptlrpc_connection_put()) LBUG
      <4>Pid: 19842, comm: obd_zombid
      <4>
      <4>Call Trace:
      <4> [<ffffffffa04f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa04f2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa082938b>] ptlrpc_connection_put+0x1db/0x1e0 [ptlrpc]
      <4> [<ffffffffa066c70d>] class_import_destroy+0x5d/0x420 [obdclass]
      <4> [<ffffffffa067067b>] obd_zombie_impexp_cull+0xcb/0x5d0 [obdclass]
      <4> [<ffffffffa0670be5>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
      <4> [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0670b80>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
      <4> [<ffffffff8109e71e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e680>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 19842, comm: obd_zombid Not tainted 2.6.32-504.16.2.el6.Bull.74.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8152a2bd>] ? panic+0xa7/0x16f
      <4> [<ffffffffa04f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa082938b>] ? ptlrpc_connection_put+0x1db/0x1e0 [ptlrpc]
      <4> [<ffffffffa066c70d>] ? class_import_destroy+0x5d/0x420 [obdclass]
      <4> [<ffffffffa067067b>] ? obd_zombie_impexp_cull+0xcb/0x5d0 [obdclass]
      <4> [<ffffffffa0670be5>] ? obd_zombie_impexp_thread+0x65/0x190 [obdclass]
      <4> [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa0670b80>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
      <4> [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109e680>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      It appears that the MDS was overloaded at that time.

      crash> sys
            KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.16.2.el6.Bull.74.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 32
              DATE: Sun Jun  7 06:41:03 2015
            UPTIME: 4 days, 19:14:07
      LOAD AVERAGE: 646.72, 556.57, 328.17
             TASKS: 2556
          NODENAME: mds111
           RELEASE: 2.6.32-504.16.2.el6.Bull.74.x86_64
           VERSION: #1 SMP Tue Apr 28 01:43:42 CEST 2015
           MACHINE: x86_64  (2266 Mhz)
            MEMORY: 64 GB
             PANIC: "Kernel panic - not syncing: LBUG"
      

      You will find attached some traces of my analysis of the vmcore. The customer is a blacksite, I can't provide the vmcore.

      It appears that the import involved in the LBUG is ffff880903ac1000. In the console, right before the LBUG, we can observe a LustreError involving the same import. It is also reported in the debug log:

      02000000:00020000:28.0F:1433652063.227347:0:12255:0:(sec.c:379:import_sec_validate_get()) import ffff880903ac1000 (FULL) with no sec
      00000100:00040000:28.0:1433652063.248671:0:19842:0:(connection.c:104:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
      00000100:00040000:28.0:1433652063.259452:0:19842:0:(connection.c:104:ptlrpc_connection_put()) LBUG
      

      This import is Lustre client compute2823. Here is the console output of the compute node:

      00000001:02000400:11.0:1433651401.837787:0:30301:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun  7 06:30:01 2015
      
      00000001:02000400:11.0:1433651701.267541:0:31155:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun  7 06:35:01 2015
      
      00000001:02000400:10.0:1433652001.829909:0:31973:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun  7 06:40:01 2015
      
      00000800:00020000:31.0:1433652118.496756:0:15908:0:(o2iblnd_cb.c:3018:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
      00000800:00020000:31.0:1433652118.505625:0:15908:0:(o2iblnd_cb.c:3081:kiblnd_check_conns()) Timed out RDMA with X.Y.Z.42@o2ib11 (54): c: 0, oc: 0, rc: 8
      00000100:00000400:19.0:1433652118.517268:0:15965:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118]  req@ffff880c0cdf1400 x1502885054441032/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      00000100:00000400:7.0:1433652118.518199:0:15961:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118]  req@ffff880a81a1b800 x1502885054432068/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      00000100:00000400:3.0:1433652118.518222:0:15945:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118]  req@ffff8805270aa800 x1502885054436340/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      00000100:00000400:10.0:1433652118.518231:0:15960:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118]  req@ffff880c02ee5c00 x1502885054439124/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      00000100:00000400:27.0:1433652118.518233:0:15949:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118]  req@ffff880a85e69800 x1502885054432296/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      [...]
      00000100:02020000:14.0:1433653897.305027:0:15941:0:(import.c:1359:ptlrpc_import_recovery_state_machine()) 167-0: ptmp2-MDT0000-mdc-ffff88047c753800: This client was evicted by ptmp2-MDT0000; in progress operations using this service will fail.
      

      This client has been evicted by the failover MDS. This is the only client we found evicted. We also have a vmcore for this node.

      06/08/2015 06:46 AM
      compute2823: /proc/fs/lustre/mdc/ptmp2-MDT0000-mdc-ffff88047c753800/state:current_state: EVICTED
      

      I have not been able to understand the root cause of the LBUG.

      Please let me know if you need further details.

      Attachments

        1. import
          13 kB
        2. note
          21 kB

        Activity

          People

            bzzz Alex Zhuravlev
            bruno.travouillon Bruno Travouillon (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: