[LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) Created: 08/Sep/16  Updated: 14/Oct/16  Resolved: 14/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57 for el6.7


Attachments: File console-lola-10.log.bz2     File console-lola-11.log.bz2     File console-lola-8.log.bz2     File console-lola-9.log.bz2     File lola-11-vmcore-dmesg.txt.bz2     File message-lola-10.log.bz2     File message-lola-11.log.bz2     File message-lola-8.log.bz2     File message-lola-9.log.bz2    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soak testing of build '20160902' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160902)
Configuration reads as:
4 MDS with 1 MDT / MDS, formatted with ldiskfs and configured pairwise in active-active HA configuration
6 OSS with 4 OST / OSS formatted with ldiskfs and configured pairwise in active-active HA configuration
DNE is enabled

Sequence of events

  • 2016-09-06 02:51:28,201:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-8 (mdt-0) --> lola-9)
  • 2016-09-06 03:41:33,479:fsmgmt.fsmgmt:INFO mds_failover just completed (mdt-0 failed back to lola-8)
  • 2016-09-06 03:41:17 MDS lola-11 crashed with error message:
    <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
    <0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) LBUG
    <4>Pid: 6666, comm: mdt03_002
    <4>
    <4>Call Trace:
    <4> [<ffffffffa07f0875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
    <4> [<ffffffffa07f0e77>] lbug_with_loc+0x47/0xb0 [libcfs]
    <4> [<ffffffffa1203071>] mdt_remote_object_lock+0x491/0x4a0 [mdt]
    <4> [<ffffffffa12298a0>] mdt_reint_open+0x2b90/0x3180 [mdt]
    <4> [<ffffffffa1211ead>] mdt_reint_rec+0x5d/0x200 [mdt]
    <4> [<ffffffffa11fd5db>] mdt_reint_internal+0x62b/0xa50 [mdt]
    <4> [<ffffffffa11fdbf6>] mdt_intent_reint+0x1f6/0x440 [mdt]
    <4> [<ffffffffa11fb8be>] mdt_intent_policy+0x4be/0xc70 [mdt]
    <4> [<ffffffffa0ab77c7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
    <4> [<ffffffffa0ae2c27>] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
    <4> [<ffffffffa0b68b21>] tgt_enqueue+0x61/0x230 [ptlrpc]
    <4> [<ffffffffa0b69ccc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
    <4> [<ffffffffa0b16501>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
    <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
    <4> [<ffffffff810a138e>] kthread+0x9e/0xc0
    <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
    <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
    <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
    <4>
    <0>Kernel panic - not syncing: LBUG
    <4>Pid: 6666, comm: mdt03_002 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
    <4>Call Trace:
    <4> [<ffffffff81539407>] ? panic+0xa7/0x16f
    <4> [<ffffffffa07f0ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
    <4> [<ffffffffa1203071>] ? mdt_remote_object_lock+0x491/0x4a0 [mdt]
    <4> [<ffffffffa12298a0>] ? mdt_reint_open+0x2b90/0x3180 [mdt]
    <4> [<ffffffffa1211ead>] ? mdt_reint_rec+0x5d/0x200 [mdt]
    <4> [<ffffffffa11fd5db>] ? mdt_reint_internal+0x62b/0xa50 [mdt]
    <4> [<ffffffffa11fdbf6>] ? mdt_intent_reint+0x1f6/0x440 [mdt]
    <4> [<ffffffffa11fb8be>] ? mdt_intent_policy+0x4be/0xc70 [mdt]
    <4> [<ffffffffa0ab77c7>] ? ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
    <4> [<ffffffffa0ae2c27>] ? ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
    <4> [<ffffffffa0b68b21>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
    <4> [<ffffffffa0b69ccc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
    <4> [<ffffffffa0b16501>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
    <4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
    <4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
    <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
    <4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
    <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
    

    Attached files:
    console, message logs of all MDS nodes; vmcore-dmesg.txt of lola-11.
    crash dump is available.



 Comments   
Comment by Frank Heckes (Inactive) [ 08/Sep/16 ]

crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17

Comment by nasf (Inactive) [ 12/Sep/16 ]

The issue may be related with the patch http://review.whamcloud.com/#/c/19041. Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered.

According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception:

<3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
<3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0]. It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen.

Comment by Gerrit Updater [ 12/Sep/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438
Subject: LU-8592 dne: cache remote ibits lock properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

Comment by nasf (Inactive) [ 12/Sep/16 ]

Frank,

Would you please to try this patch?
Thanks!

Comment by Frank Heckes (Inactive) [ 12/Sep/16 ]

Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

Comment by Frank Heckes (Inactive) [ 20/Sep/16 ]

Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)
Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

Comment by Gerrit Updater [ 13/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/
Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040

Comment by Peter Jones [ 14/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:18:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.