[LU-5163] (lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed Created: 09/Jun/14  Updated: 10/Jul/18  Resolved: 05/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.11.0, Lustre 2.10.2
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: dne2, mdd, migration

Issue Links:
Duplicate
is duplicated by LU-5242 Test hang sanity test_132, test_133: ... Resolved
Related
is related to LU-5125 lu_object_attr() should return 0 for ... Closed
is related to LU-5388 Interop 2.5.2<->2.6 failure on test s... Resolved
is related to LU-11135 racer: ASSERTION( ((o)->lo_header->lo... Resolved
Severity: 3
Rank (Obsolete): 14233

 Description   

Running racer on 2.5.59-66-g47cde80 with MDSCOUNT=4 and file_create.sh modified to do fewer writes.

[ 2383.017816] LustreError: 16406:0:(lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
[ 2383.020138] LustreError: 16406:0:(lu_object.h:852:lu_object_attr()) LBUG
[ 2383.021290] Pid: 16406, comm: mdt00_002
[ 2383.021909] 
[ 2383.021910] Call Trace:
[ 2383.022612]  [<ffffffffa02be8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[ 2383.023915]  [<ffffffffa02beec7>] lbug_with_loc+0x47/0xb0 [libcfs]
[ 2383.025200]  [<ffffffffa0bc6e8b>] mdd_migrate_entries+0xffb/0x1100 [mdd]
[ 2383.026503]  [<ffffffffa0bc7af7>] mdd_migrate+0xb67/0x13d0 [mdd]
[ 2383.027666]  [<ffffffffa0afc61b>] ? osd_object_read_unlock+0x8b/0xd0 [osd_ldiskfs]
[ 2383.029154]  [<ffffffffa0c25c78>] mdt_reint_migrate_internal+0x15c8/0x1b40 [mdt]
[ 2383.030578]  [<ffffffff815547cb>] ? _spin_unlock+0x2b/0x40
[ 2383.031524]  [<ffffffffa0c299d3>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt]
[ 2383.032776]  [<ffffffffa0c03c95>] ? mdt_ucred+0x15/0x20 [mdt]
[ 2383.033739]  [<ffffffffa0c1e4cc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
[ 2383.034849]  [<ffffffffa06b7026>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc]
[ 2383.035986]  [<ffffffffa0c29da3>] mdt_reint_migrate+0x13/0x20 [mdt]
[ 2383.037045]  [<ffffffffa0c223e1>] mdt_reint_rec+0x41/0xe0 [mdt]
[ 2383.038083]  [<ffffffffa0c07e43>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
[ 2383.039234]  [<ffffffffa0c086cb>] mdt_reint+0x6b/0x120 [mdt]
[ 2383.040228]  [<ffffffffa06f0a85>] tgt_request_handle+0x245/0xad0 [ptlrpc]
[ 2383.041371]  [<ffffffffa06a18d1>] ptlrpc_main+0xce1/0x1970 [ptlrpc]
[ 2383.042435]  [<ffffffffa06a0bf0>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
[ 2383.043468]  [<ffffffff8109eab6>] kthread+0x96/0xa0
[ 2383.044261]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[ 2383.045066]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
[ 2383.045986]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[ 2383.046914]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
[ 2383.047770]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
[ 2383.048666] 
[ 2383.049141] Kernel panic - not syncing: LBUG

This is from mdd_migrate_entries() line 3473.

                child = mdd_object_find(env, mdd, &ent->lde_fid);
                if (IS_ERR(child))
                        GOTO(out, rc = PTR_ERR(child));

                is_dir = S_ISDIR(lu_object_attr(&child->mod_obj.mo_lu));


 Comments   
Comment by Andreas Dilger [ 09/Jun/14 ]

Does this have the patch from LU-5069 or is it a new issue?

Comment by John Hammond [ 09/Jun/14 ]

This is on today's master which has that patch. This is a new issue which is not related to rename sanity checking.

Comment by Di Wang [ 10/Jun/14 ]

Hmm, it seems the name entry is a dangling entry, and we probably can skip this name entry during migration. But I am not sure how this dangling entry is created, and not sure it is related with migration or not.

Comment by Di Wang [ 11/Jun/14 ]

Hmm, I did a few test, it seems this related with http://review.whamcloud.com/9538, where it does is_subdir check without ldlm lock protection. Though I am not so sure, since it is not easy to get debug log to analyze it. But I tried to re-write the patch, use lookup(..), instead of is_subdir, it seems fix the LBUG issue, at least I can not reproduce the lbug with the patch http://review.whamcloud.com/10673

Comment by John Hammond [ 11/Jun/14 ]

I cherry-picked http://review.whamcloud.com/10673 onto today's master (2.5.59-79-g5c4573e) and I see the same LBUG with the same stack trace.

Comment by John Hammond [ 11/Jun/14 ]

Regardless of LDLM issues, this FID is coming straight from the disk so we should not be asserting on the existence of the object. I know that some may want to leave this in until they understand the underlying issue but I would prefer to have a less crashy implementation of an already best-effort type operation like migration.

Comment by Alex Zhuravlev [ 11/Jun/14 ]

my take on this is that it's better to crash and restart instead of corrupting a filesystem silently (potentially, given we don't understand everything at the moment).

Comment by Andreas Dilger [ 08/Jan/15 ]

Hit this again running racer on my single-node test system (2x MDT, 3x OST). The stack is a little different than the original one, so I thought I'd post it here:

LustreError: 25072:0:(lu_object.h:859:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed:
LustreError: 25072:0:(lu_object.h:859:lu_object_attr()) LBUG
Pid: 25072, comm: mdt00_005

Call Trace:
 [<ffffffffa13a7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa13a7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1015e07>] lu_object_attr+0x47/0x50 [mdt]
 [<ffffffffa10186b3>] mdt_reint_migrate_internal+0x6a3/0x1b50 [mdt]
 [<ffffffffa101d7cb>] mdt_reint_rename_or_migrate+0x3cb/0x6c0 [mdt]
 [<ffffffffa101dad3>] mdt_reint_migrate+0x13/0x20 [mdt]
 [<ffffffffa1015fcd>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa0ffa19b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
 [<ffffffffa0ffa9fb>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa0c5398e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
 [<ffffffffa0c03721>] ptlrpc_main+0xe41/0x1960 [ptlrpc]

In particular, this is failing in mdt_reint_migrate_internal() rather than in mdd_migrate_entries():

(gdb) list *(mdt_reint_migrate_internal+0x6a3)
0x266e3 is in mdt_reint_migrate_internal (/usr/src/lustre-head/lustre/mdt/mdt_reint.c:1518).
1508            mold = mdt_object_find(info->mti_env, info->mti_mdt, old_fid);
1509            if (IS_ERR(mold))
1510                    GOTO(out_unlock_parent, rc = PTR_ERR(mold));
1511
1512            if (mdt_object_remote(mold)) {
1513                    CERROR("%s: source "DFID" is on the remote MDT\n",
1514                           mdt_obd_name(info->mti_mdt), PFID(old_fid));
1515                    GOTO(out_put_child, rc = -EREMOTE);
1516            }
1517
1518            if (S_ISREG(lu_object_attr(&mold->mot_obj)) &&
1519                !mdt_object_remote(msrcdir)) {
1520                    CERROR("%s: parent "DFID" is still on the same"
1521                           " MDT, which should be migrated first:"
1522                           " rc = %d\n", mdt_obd_name(info->mti_mdt),

Is mold locked at this point after mdd_object_find->lu_object_find? Otherwise it seems entirely possible to delete the object between the time it is looked up and when the LASSERT() trips in lu_object_attr().

Comment by Di Wang [ 15/Jan/15 ]

Well, the parent has been locked, and the mold object is gotten by name->FID lookup. If the name entry exists, but the object (mold) does not exist, it means the name entry is becoming a dangling entry during the racer.

Comment by Di Wang [ 15/Jan/15 ]

this probably because we did not lock all of the children when migrating the directory. hmm

Comment by Lai Siyao [ 13/Apr/17 ]

I met this in local test, and am looking into it now.

Comment by Lai Siyao [ 14/Apr/17 ]

This looks to be in mdd_migrate_entries() -> mdd_object_type(child), which asserts child should exist, while it's not always true, because child is not locked, I'll push a patch soon.

Comment by Gerrit Updater [ 14/Apr/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/26620
Subject: LU-5163 mdd: obtain type from dirent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c6fc00290924905e6c8c62bdecf283b04bc6f712

Comment by James Casper [ 14/Nov/17 ]

Seen in master 2.10.55 b3667:

https://testing.hpdd.intel.com/test_sessions/0a9fe899-314e-4874-bd82-8a966bc7ad88

Comment by Bob Glossman (Inactive) [ 16/Nov/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/77756eb8-cb0a-11e7-9c63-52540065bddc

Comment by Sarah Liu [ 29/Nov/17 ]

on 2.10.2
https://testing.hpdd.intel.com/test_sets/c087a034-d4bb-11e7-8027-52540065bddc

Comment by Gerrit Updater [ 04/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26620/
Subject: LU-5163 mdd: migrated entry may not exist
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 542b76d142c491a0a1bf8a2f4fd22af4733f59cb

Comment by Peter Jones [ 04/Jan/18 ]

Lai

Can this ticket be marked as resolved now or is Di's patch still needed too?

Peter

Comment by Lai Siyao [ 05/Jan/18 ]

Peter, it can be marked as resolved now, and Di's patch is not needed IMHO.

Comment by Peter Jones [ 05/Jan/18 ]

ok - thanks!

Comment by Minh Diep [ 12/Feb/18 ]

+1 on b2_10

https://testing.hpdd.intel.com/test_sets/cbf69ae0-0ed3-11e8-a6ad-52540065bddc

Comment by Gerrit Updater [ 12/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31281
Subject: LU-5163 mdd: migrated entry may not exist
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: dac6aa7caf537d4d3e7a772c36fbdb355da47e56

Comment by Gerrit Updater [ 19/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31281/
Subject: LU-5163 mdd: migrated entry may not exist
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 8af27a326ffebc7c0f40734767035d353bafa43c

Generated at Sat Feb 10 01:49:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.