[LU-8592] MDS crashed with ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) Created: 08/Sep/16 Updated: 14/Oct/16 Resolved: 14/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Error happened during soak testing of build '20160902' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160902) Sequence of events
|
| Comments |
| Comment by Frank Heckes (Inactive) [ 08/Sep/16 ] |
|
crash dump had been saved to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8592/lola-11/27.0.0.1-2016-09-06-03:41:17 |
| Comment by nasf (Inactive) [ 12/Sep/16 ] |
|
The issue may be related with the patch http://review.whamcloud.com/#/c/19041. Such patch caches the metadata attributes on remote MDT. To invalid the cached attributes, it makes the remote ibits lock callback data to reference the cached object. Such reference will be released when the mdt_remote_blocking_ast() is triggered. According to the logs on the lola-11, just before the ASSERTION, the OSP detected some exception: <3>LustreError: 167-0: soaked-MDT0000-osp-MDT0003: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. <3>LustreError: 32471:0:(ldlm_resource.c:878:ldlm_resource_complain()) soaked-MDT0000-osp-MDT0003: namespace resource [0x2c00013a4:0x97e0:0x0].0x0 (ffff88078ba0f2c0) refcount nonzero (1) after lock cleanup; forcing cleanup. That means the connection from the MDT3 to the MDT0 was evicted by the MDT0, such IMP_EVENT_INVALIDATE event triggered ldlm_namespace_cleanup(). Unfortunately, at that time, some up layer user was referencing the resource [0x2c00013a4:0x97e0:0x0]. It is suspected that such resource (object) has been cleaned by force, then caused related object's reference wrong, as to the subsequent mdt_remote_object_lock() hit the ASSERTION. Currently, I have no the detailed scenario to reproduce the failure, but I will make a patch to enhance related logic and try to check what will happen. |
| Comment by Gerrit Updater [ 12/Sep/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22438 |
| Comment by nasf (Inactive) [ 12/Sep/16 ] |
|
Frank, Would you please to try this patch? |
| Comment by Frank Heckes (Inactive) [ 12/Sep/16 ] |
|
Sure, I'll do. Could take till Wednesday before I start, as I have to reproduce an error for EE-3.1 first. |
| Comment by Frank Heckes (Inactive) [ 20/Sep/16 ] |
|
Installed build containing http://review.whamcloud.com/22438 patchset #4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160916) |
| Comment by Gerrit Updater [ 13/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22438/ |
| Comment by Peter Jones [ 14/Oct/16 ] |
|
Landed for 2.9 |