[LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57 for el6.7

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error happened during soak testing of build '20160902' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160902)
Configuration reads as:
4 MDS with 1 MDT / MDS, formatted with ldiskfs and configured pairwise in active-active HA configuration
6 OSS with 4 OST / OSS formatted with ldiskfs and configured pairwise in active-active HA configuration
DNE is enabled

Sequence of events

2016-09-06 02:51:28,201:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-8 (mdt-0) --> lola-9)
2016-09-06 03:41:33,479:fsmgmt.fsmgmt:INFO mds_failover just completed (mdt-0 failed back to lola-8)

2016-09-06 03:41:17 MDS lola-11 crashed with error message:

<0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
<0>LustreError: 6666:0:(lu_object.h:716:lu_object_get()) LBUG
<4>Pid: 6666, comm: mdt03_002
<4>
<4>Call Trace:
<4> [<ffffffffa07f0875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa07f0e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa1203071>] mdt_remote_object_lock+0x491/0x4a0 [mdt]
<4> [<ffffffffa12298a0>] mdt_reint_open+0x2b90/0x3180 [mdt]
<4> [<ffffffffa1211ead>] mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa11fd5db>] mdt_reint_internal+0x62b/0xa50 [mdt]
<4> [<ffffffffa11fdbf6>] mdt_intent_reint+0x1f6/0x440 [mdt]
<4> [<ffffffffa11fb8be>] mdt_intent_policy+0x4be/0xc70 [mdt]
<4> [<ffffffffa0ab77c7>] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
<4> [<ffffffffa0ae2c27>] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
<4> [<ffffffffa0b68b21>] tgt_enqueue+0x61/0x230 [ptlrpc]
<4> [<ffffffffa0b69ccc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [<ffffffffa0b16501>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
<4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [<ffffffff810a138e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 6666, comm: mdt03_002 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81539407>] ? panic+0xa7/0x16f
<4> [<ffffffffa07f0ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa1203071>] ? mdt_remote_object_lock+0x491/0x4a0 [mdt]
<4> [<ffffffffa12298a0>] ? mdt_reint_open+0x2b90/0x3180 [mdt]
<4> [<ffffffffa1211ead>] ? mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa11fd5db>] ? mdt_reint_internal+0x62b/0xa50 [mdt]
<4> [<ffffffffa11fdbf6>] ? mdt_intent_reint+0x1f6/0x440 [mdt]
<4> [<ffffffffa11fb8be>] ? mdt_intent_policy+0x4be/0xc70 [mdt]
<4> [<ffffffffa0ab77c7>] ? ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
<4> [<ffffffffa0ae2c27>] ? ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
<4> [<ffffffffa0b68b21>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
<4> [<ffffffffa0b69ccc>] ? tgt_request_handle+0x8ec/0x1440 [ptlrpc]
<4> [<ffffffffa0b16501>] ? ptlrpc_main+0xd31/0x1800 [ptlrpc]
<4> [<ffffffffa0b157d0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
<4> [<ffffffff810a138e>] ? kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
<4> [<ffffffff810a12f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

Attached files:
console, message logs of all MDS nodes; vmcore-dmesg.txt of lola-11.
crash dump is available.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-10.log.bz2
54 kB
08/Sep/16 12:23 PM
console-lola-11.log.bz2
54 kB
08/Sep/16 12:23 PM
console-lola-8.log.bz2
94 kB
08/Sep/16 12:23 PM
console-lola-9.log.bz2
63 kB
08/Sep/16 12:23 PM
lola-11-vmcore-dmesg.txt.bz2
29 kB
08/Sep/16 12:23 PM
message-lola-10.log.bz2
188 kB
08/Sep/16 12:23 PM
message-lola-11.log.bz2
183 kB
08/Sep/16 12:23 PM
message-lola-8.log.bz2
398 kB
08/Sep/16 12:23 PM
message-lola-9.log.bz2
405 kB
08/Sep/16 12:23 PM

Activity

[LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 )

Peter Jones added a comment - 14/Oct/16 12:55 AM

Landed for 2.9

Peter Jones added a comment - 14/Oct/16 12:55 AM Landed for 2.9

Gerrit Updater added a comment - 13/Oct/16 11:36 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/
Subject: ~~LU-8592~~ mdt: hold mdt_device::mdt_md_root until service stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040

Gerrit Updater added a comment - 13/Oct/16 11:36 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/ Subject: LU-8592 mdt: hold mdt_device::mdt_md_root until service stop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1bf196aaf5adb9a1c67886c0cd6a780ec6838040

Frank Heckes (Inactive) added a comment - 20/Sep/16 9:38 AM - edited

Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916)
Test session is running since 2016-09-17. No occurrence of this bug so far, but due to ~~LU-8580~~ (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

Frank Heckes (Inactive) added a comment - 20/Sep/16 9:38 AM - edited Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916 ) Test session is running since 2016-09-17. No occurrence of this bug so far, but due to LU-8580 (happens with frequency of 2 hours) it's difficult to say whether your fix resolves the problem.

Frank Heckes (Inactive) added a comment - 12/Sep/16 4:59 PM

Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

Frank Heckes (Inactive) added a comment - 12/Sep/16 4:59 PM Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first.

nasf (Inactive) added a comment - 12/Sep/16 3:23 PM

Frank,

Would you please to try this patch?
Thanks!

nasf (Inactive) added a comment - 12/Sep/16 3:23 PM Frank, Would you please to try this patch? Thanks!

Gerrit Updater added a comment - 12/Sep/16 3:20 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438
Subject: ~~LU-8592~~ dne: cache remote ibits lock properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

Gerrit Updater added a comment - 12/Sep/16 3:20 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438 Subject: LU-8592 dne: cache remote ibits lock properly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8bae476f5258e78be6cc7a58a19be8f95fc59209

nasf (Inactive) added a comment - 12/Sep/16 3:18 PM - edited

The issue may be related with the patch http://review.whamcloud.com/#/c/19041. Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered.

According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception:

<3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
<3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0]. It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen.

nasf (Inactive) added a comment - 12/Sep/16 3:18 PM - edited The issue may be related with the patch http://review.whamcloud.com/#/c/19041 . Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered. According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception: <3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. <3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0] . It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen.

Frank Heckes (Inactive) added a comment - 08/Sep/16 12:25 PM

crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17

Frank Heckes (Inactive) added a comment - 08/Sep/16 12:25 PM crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17

People

Assignee:: nasf (Inactive)

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Sep/16 12:13 PM

Updated:: 14/Oct/16 12:55 AM

Resolved:: 14/Oct/16 12:55 AM