Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.5.4
-
None
-
RHEL6.6 + Lustre 2.5.3.90 w/ bull patches
-
3
-
9223372036854775807
Description
We hit the following LBUG on one of our MDS.
<3>LustreError: 12255:0:(sec.c:379:import_sec_validate_get()) import ffff880903ac1000 (FULL) with no sec
<0>LustreError: 19842:0:(connection.c:104:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
<0>LustreError: 19842:0:(connection.c:104:ptlrpc_connection_put()) LBUG
<4>Pid: 19842, comm: obd_zombid
<4>
<4>Call Trace:
<4> [<ffffffffa04f2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa04f2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa082938b>] ptlrpc_connection_put+0x1db/0x1e0 [ptlrpc]
<4> [<ffffffffa066c70d>] class_import_destroy+0x5d/0x420 [obdclass]
<4> [<ffffffffa067067b>] obd_zombie_impexp_cull+0xcb/0x5d0 [obdclass]
<4> [<ffffffffa0670be5>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0670b80>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109e71e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e680>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 19842, comm: obd_zombid Not tainted 2.6.32-504.16.2.el6.Bull.74.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152a2bd>] ? panic+0xa7/0x16f
<4> [<ffffffffa04f2eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa082938b>] ? ptlrpc_connection_put+0x1db/0x1e0 [ptlrpc]
<4> [<ffffffffa066c70d>] ? class_import_destroy+0x5d/0x420 [obdclass]
<4> [<ffffffffa067067b>] ? obd_zombie_impexp_cull+0xcb/0x5d0 [obdclass]
<4> [<ffffffffa0670be5>] ? obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81064bc0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0670b80>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109e680>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
It appears that the MDS was overloaded at that time.
crash> sys
KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.16.2.el6.Bull.74.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Sun Jun 7 06:41:03 2015
UPTIME: 4 days, 19:14:07
LOAD AVERAGE: 646.72, 556.57, 328.17
TASKS: 2556
NODENAME: mds111
RELEASE: 2.6.32-504.16.2.el6.Bull.74.x86_64
VERSION: #1 SMP Tue Apr 28 01:43:42 CEST 2015
MACHINE: x86_64 (2266 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: LBUG"
You will find attached some traces of my analysis of the vmcore. The customer is a blacksite, I can't provide the vmcore.
It appears that the import involved in the LBUG is ffff880903ac1000. In the console, right before the LBUG, we can observe a LustreError involving the same import. It is also reported in the debug log:
02000000:00020000:28.0F:1433652063.227347:0:12255:0:(sec.c:379:import_sec_validate_get()) import ffff880903ac1000 (FULL) with no sec
00000100:00040000:28.0:1433652063.248671:0:19842:0:(connection.c:104:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed:
00000100:00040000:28.0:1433652063.259452:0:19842:0:(connection.c:104:ptlrpc_connection_put()) LBUG
This import is Lustre client compute2823. Here is the console output of the compute node:
00000001:02000400:11.0:1433651401.837787:0:30301:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun 7 06:30:01 2015 00000001:02000400:11.0:1433651701.267541:0:31155:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun 7 06:35:01 2015 00000001:02000400:10.0:1433652001.829909:0:31973:0:(debug.c:339:libcfs_debug_mark_buffer()) DEBUG MARKER: Sun Jun 7 06:40:01 2015 00000800:00020000:31.0:1433652118.496756:0:15908:0:(o2iblnd_cb.c:3018:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds 00000800:00020000:31.0:1433652118.505625:0:15908:0:(o2iblnd_cb.c:3081:kiblnd_check_conns()) Timed out RDMA with X.Y.Z.42@o2ib11 (54): c: 0, oc: 0, rc: 8 00000100:00000400:19.0:1433652118.517268:0:15965:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118] req@ffff880c0cdf1400 x1502885054441032/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000400:7.0:1433652118.518199:0:15961:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118] req@ffff880a81a1b800 x1502885054432068/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000400:3.0:1433652118.518222:0:15945:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118] req@ffff8805270aa800 x1502885054436340/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000400:10.0:1433652118.518231:0:15960:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118] req@ffff880c02ee5c00 x1502885054439124/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000400:27.0:1433652118.518233:0:15949:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1433652063/real 1433652118] req@ffff880a85e69800 x1502885054432296/t0(0) o103->ptmp2-MDT0000-mdc-ffff88047c753800@X.Y.Z.42@o2ib11:17/18 lens 328/224 e 0 to 1 dl 1433652672 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 [...] 00000100:02020000:14.0:1433653897.305027:0:15941:0:(import.c:1359:ptlrpc_import_recovery_state_machine()) 167-0: ptmp2-MDT0000-mdc-ffff88047c753800: This client was evicted by ptmp2-MDT0000; in progress operations using this service will fail.
This client has been evicted by the failover MDS. This is the only client we found evicted. We also have a vmcore for this node.
06/08/2015 06:46 AM compute2823: /proc/fs/lustre/mdc/ptmp2-MDT0000-mdc-ffff88047c753800/state:current_state: EVICTED
I have not been able to understand the root cause of the LBUG.
Please let me know if you need further details.