[LU-7825] ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0 Created: 27/Feb/16  Updated: 16/Mar/16  Resolved: 16/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/8/


Attachments: File console-lola-10.log.bz2     File console-lola-9.log.bz2     File lola-9-vmcore-dmesg.txt.bz2     File messages-lola-10.log.bz2     File messages-lola-9.log.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

Sequence of events:

  • 2016-02-27 02:04:02,121:fsmgmt.fsmgmt:INFO mds_failover just completed (lola-10 ---> lola-11)
  • Feb 27 02:06:44 lola-10 kernel: Lustre: soaked-MDT0005: Recovery over after 2:42, of 16 clients 14 recovered and 2 were evicted.
  • Feb 27 02:12:06 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 8:02, of 16 clients 11 recovered and 5 were evicted.
  • 2016-02-27 02:12:58 lola-9 (different HA pair) crashed

The error reads as:

<0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_writers > 0 ) failed: 
<0>LustreError: 5003:0:(ldlm_lock.c:810:ldlm_lock_decref_internal_nolock()) LBUG
<4>Pid: 5003, comm: mdt02_007
<4>
<4>Call Trace:
<4> [<ffffffffa0748875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0748e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0a2ef0f>] ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc]
<4> [<ffffffffa0a3102d>] ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc]
<4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass]
<4> [<ffffffffa0a325a0>] ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
<4> [<ffffffffa1164c67>] mdt_object_unlock+0xa7/0x2e0 [mdt]
<4> [<ffffffffa11867ca>] mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt]
<4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc]
<4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
<4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass]
<4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc]
<4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt]
<4> [<ffffffffa1187ec3>] mdt_reint_rename+0x13/0x20 [mdt]
<4> [<ffffffffa118118d>] mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa116cddb>] mdt_reint_internal+0x62b/0x9f0 [mdt]
<4> [<ffffffffa116d63b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0ae0c2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [<ffffffffa0a8dc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
<4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
<4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 5003, comm: mdt02_007 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81529c9c>] ? panic+0xa7/0x16f
<4> [<ffffffffa0748ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0a2ef0f>] ? ldlm_lock_decref_internal_nolock+0x17f/0x180 [ptlrpc]
<4> [<ffffffffa0a3102d>] ? ldlm_lock_decref_internal+0x4d/0xa80 [ptlrpc]
<4> [<ffffffffa083f935>] ? class_handle2object+0x95/0x190 [obdclass]
<4> [<ffffffffa0a325a0>] ? ldlm_lock_decref_and_cancel+0x80/0x150 [ptlrpc]
<4> [<ffffffffa1164c67>] ? mdt_object_unlock+0xa7/0x2e0 [mdt]
<4> [<ffffffffa11867ca>] ? mdt_reint_rename_or_migrate+0xf3a/0x2600 [mdt]
<4> [<ffffffffa0ab7bdd>] ? null_alloc_rs+0xcd/0x320 [ptlrpc]
<4> [<ffffffffa0876cbc>] ? upcall_cache_get_entry+0x29c/0x880 [obdclass]
<4> [<ffffffffa087bbf0>] ? lu_ucred+0x20/0x30 [obdclass]
<4> [<ffffffffa0a7d100>] ? lustre_pack_reply_v2+0x180/0x280 [ptlrpc]
<4> [<ffffffffa117d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt]
<4> [<ffffffffa1187ec3>] ? mdt_reint_rename+0x13/0x20 [mdt]
<4> [<ffffffffa118118d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa116cddb>] ? mdt_reint_internal+0x62b/0x9f0 [mdt]
<4> [<ffffffffa116d63b>] ? mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0ae0c2c>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [<ffffffffa0a8dc61>] ? ptlrpc_main+0xd21/0x1800 [ptlrpc]
<4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
<4> [<ffffffffa0a8cf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

Attached message, console logs of MDS nodes lola-9, lola-10 and also vmcore-dmesg.txt.
Crash file will be saved separately.



 Comments   
Comment by Frank Heckes (Inactive) [ 27/Feb/16 ]

The crash file has been saved at lhn.hpdd.intel.com:/scratch/crashdumps/lu-7825/lola-9/127.0.0.1-2016-02-27-02\:12\:58/.

Comment by Di Wang [ 27/Feb/16 ]

Hmm, it looks like lock is not released correctly in the error handler path of mdt_reint_rename_internal(). will cook a patch.

Comment by Gerrit Updater [ 27/Feb/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18707
Subject: LU-7825 mdt: release parent lock correctly for rename
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6a240c713c30cd5b167d32e5c2a163f6b18d8ef6

Comment by Gerrit Updater [ 01/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18707/
Subject: LU-7825 mdt: release parent lock correctly for rename
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 30ece848c046dda5c450dc49c6b146360c077a22

Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ]

Landed to master and b2_8. Is present in the 2.8.0 release.

Generated at Sat Feb 10 02:12:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.