Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/8/
-
3
-
9223372036854775807
Description
Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.
Sequence of events:
- 2016-02-27 02:04:02,121:fsmgmt.fsmgmt:INFO mds_failover just completed (lola-10 ---> lola-11)
- Feb 27 02:06:44 lola-10 kernel: Lustre: soaked-MDT0005: Recovery over after 2:42, of 16 clients 14 recovered and 2 were evicted.
- Feb 27 02:12:06 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 8:02, of 16 clients 11 recovered and 5 were evicted.
- 2016-02-27 02:12:58 lola-9 (different HA pair) crashed
The error reads as:
<0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0 ) failed: <0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) LBUG <4>Pid: 5003, comm: mdt02_007 <4> <4>Call Trace: <4> [<ffffffffa0748875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0748e77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0a2ef0f>] ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc] <4> [<ffffffffa0a3102d>] ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc] <4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass] <4> [<ffffffffa0a325a0>] ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc] <4> [<ffffffffa1164c67>] mdt_object_unlock+0xa7/0x2e0 [mdt] <4> [<ffffffffa11867ca>] mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt] <4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc] <4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass] <4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass] <4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc] <4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt] <4> [<ffffffffa1187ec3>] mdt_reint_rename+0x13/0x20 [mdt] <4> [<ffffffffa118118d>] mdt_reint_rec+0x5d/0x200 [mdt] <4> [<ffffffffa116cddb>] mdt_reint_internal+0x62b/0x9f0 [mdt] <4> [<ffffffffa116d63b>] mdt_reint+0x6b/0x120 [mdt] <4> [<ffffffffa0ae0c2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] <4> [<ffffffffa0a8dc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc] <4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 <4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 5003, comm: mdt02_007 Tainted: P --------------- 2.6.32-504.30.3.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81529c9c>] ? panic+0xa7/0x16f <4> [<ffffffffa0748ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0a2ef0f>] ? ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc] <4> [<ffffffffa0a3102d>] ? ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc] <4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass] <4> [<ffffffffa0a325a0>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc] <4> [<ffffffffa1164c67>] ? mdt_object_unlock+0xa7/0x2e0 [mdt] <4> [<ffffffffa11867ca>] ? mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt] <4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc] <4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass] <4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass] <4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc] <4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt] <4> [<ffffffffa1187ec3>] ? mdt_reint_rename+0x13/0x20 [mdt] <4> [<ffffffffa118118d>] ? mdt_reint_rec+0x5d/0x200 [mdt] <4> [<ffffffffa116cddb>] ? mdt_reint_internal+0x62b/0x9f0 [mdt] <4> [<ffffffffa116d63b>] ? mdt_reint+0x6b/0x120 [mdt] <4> [<ffffffffa0ae0c2c>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc] <4> [<ffffffffa0a8dc61>] ? ptlrpc_main+0xd21/0x1800 [ptlrpc] <4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 <4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] <4> [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
Attached message, console logs of MDS nodes lola-9, lola-10 and also vmcore-dmesg.txt.
Crash file will be saved separately.