[LU-5163] (lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed Created: 09/Jun/14 Updated: 10/Jul/18 Resolved: 05/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.11.0, Lustre 2.10.2 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | John Hammond | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2, mdd, migration | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 14233 | ||||||||||||||||||||||||
| Description |
|
Running racer on 2.5.59-66-g47cde80 with MDSCOUNT=4 and file_create.sh modified to do fewer writes. [ 2383.017816] LustreError: 16406:0:(lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: [ 2383.020138] LustreError: 16406:0:(lu_object.h:852:lu_object_attr()) LBUG [ 2383.021290] Pid: 16406, comm: mdt00_002 [ 2383.021909] [ 2383.021910] Call Trace: [ 2383.022612] [<ffffffffa02be8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 2383.023915] [<ffffffffa02beec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 2383.025200] [<ffffffffa0bc6e8b>] mdd_migrate_entries+0xffb/0x1100 [mdd] [ 2383.026503] [<ffffffffa0bc7af7>] mdd_migrate+0xb67/0x13d0 [mdd] [ 2383.027666] [<ffffffffa0afc61b>] ? osd_object_read_unlock+0x8b/0xd0 [osd_ldiskfs] [ 2383.029154] [<ffffffffa0c25c78>] mdt_reint_migrate_internal+0x15c8/0x1b40 [mdt] [ 2383.030578] [<ffffffff815547cb>] ? _spin_unlock+0x2b/0x40 [ 2383.031524] [<ffffffffa0c299d3>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt] [ 2383.032776] [<ffffffffa0c03c95>] ? mdt_ucred+0x15/0x20 [mdt] [ 2383.033739] [<ffffffffa0c1e4cc>] ? mdt_root_squash+0x2c/0x3f0 [mdt] [ 2383.034849] [<ffffffffa06b7026>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc] [ 2383.035986] [<ffffffffa0c29da3>] mdt_reint_migrate+0x13/0x20 [mdt] [ 2383.037045] [<ffffffffa0c223e1>] mdt_reint_rec+0x41/0xe0 [mdt] [ 2383.038083] [<ffffffffa0c07e43>] mdt_reint_internal+0x4c3/0x7c0 [mdt] [ 2383.039234] [<ffffffffa0c086cb>] mdt_reint+0x6b/0x120 [mdt] [ 2383.040228] [<ffffffffa06f0a85>] tgt_request_handle+0x245/0xad0 [ptlrpc] [ 2383.041371] [<ffffffffa06a18d1>] ptlrpc_main+0xce1/0x1970 [ptlrpc] [ 2383.042435] [<ffffffffa06a0bf0>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] [ 2383.043468] [<ffffffff8109eab6>] kthread+0x96/0xa0 [ 2383.044261] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 2383.045066] [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40 [ 2383.045986] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 2383.046914] [<ffffffff8109ea20>] ? kthread+0x0/0xa0 [ 2383.047770] [<ffffffff8100c300>] ? child_rip+0x0/0x20 [ 2383.048666] [ 2383.049141] Kernel panic - not syncing: LBUG This is from mdd_migrate_entries() line 3473. child = mdd_object_find(env, mdd, &ent->lde_fid);
if (IS_ERR(child))
GOTO(out, rc = PTR_ERR(child));
is_dir = S_ISDIR(lu_object_attr(&child->mod_obj.mo_lu));
|
| Comments |
| Comment by Andreas Dilger [ 09/Jun/14 ] |
|
Does this have the patch from |
| Comment by John Hammond [ 09/Jun/14 ] |
|
This is on today's master which has that patch. This is a new issue which is not related to rename sanity checking. |
| Comment by Di Wang [ 10/Jun/14 ] |
|
Hmm, it seems the name entry is a dangling entry, and we probably can skip this name entry during migration. But I am not sure how this dangling entry is created, and not sure it is related with migration or not. |
| Comment by Di Wang [ 11/Jun/14 ] |
|
Hmm, I did a few test, it seems this related with http://review.whamcloud.com/9538, where it does is_subdir check without ldlm lock protection. Though I am not so sure, since it is not easy to get debug log to analyze it. But I tried to re-write the patch, use lookup(..), instead of is_subdir, it seems fix the LBUG issue, at least I can not reproduce the lbug with the patch http://review.whamcloud.com/10673 |
| Comment by John Hammond [ 11/Jun/14 ] |
|
I cherry-picked http://review.whamcloud.com/10673 onto today's master (2.5.59-79-g5c4573e) and I see the same LBUG with the same stack trace. |
| Comment by John Hammond [ 11/Jun/14 ] |
|
Regardless of LDLM issues, this FID is coming straight from the disk so we should not be asserting on the existence of the object. I know that some may want to leave this in until they understand the underlying issue but I would prefer to have a less crashy implementation of an already best-effort type operation like migration. |
| Comment by Alex Zhuravlev [ 11/Jun/14 ] |
|
my take on this is that it's better to crash and restart instead of corrupting a filesystem silently (potentially, given we don't understand everything at the moment). |
| Comment by Andreas Dilger [ 08/Jan/15 ] |
|
Hit this again running racer on my single-node test system (2x MDT, 3x OST). The stack is a little different than the original one, so I thought I'd post it here: LustreError: 25072:0:(lu_object.h:859:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: LustreError: 25072:0:(lu_object.h:859:lu_object_attr()) LBUG Pid: 25072, comm: mdt00_005 Call Trace: [<ffffffffa13a7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa13a7e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa1015e07>] lu_object_attr+0x47/0x50 [mdt] [<ffffffffa10186b3>] mdt_reint_migrate_internal+0x6a3/0x1b50 [mdt] [<ffffffffa101d7cb>] mdt_reint_rename_or_migrate+0x3cb/0x6c0 [mdt] [<ffffffffa101dad3>] mdt_reint_migrate+0x13/0x20 [mdt] [<ffffffffa1015fcd>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa0ffa19b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] [<ffffffffa0ffa9fb>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0c5398e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] [<ffffffffa0c03721>] ptlrpc_main+0xe41/0x1960 [ptlrpc] In particular, this is failing in mdt_reint_migrate_internal() rather than in mdd_migrate_entries(): (gdb) list *(mdt_reint_migrate_internal+0x6a3) 0x266e3 is in mdt_reint_migrate_internal (/usr/src/lustre-head/lustre/mdt/mdt_reint.c:1518). 1508 mold = mdt_object_find(info->mti_env, info->mti_mdt, old_fid); 1509 if (IS_ERR(mold)) 1510 GOTO(out_unlock_parent, rc = PTR_ERR(mold)); 1511 1512 if (mdt_object_remote(mold)) { 1513 CERROR("%s: source "DFID" is on the remote MDT\n", 1514 mdt_obd_name(info->mti_mdt), PFID(old_fid)); 1515 GOTO(out_put_child, rc = -EREMOTE); 1516 } 1517 1518 if (S_ISREG(lu_object_attr(&mold->mot_obj)) && 1519 !mdt_object_remote(msrcdir)) { 1520 CERROR("%s: parent "DFID" is still on the same" 1521 " MDT, which should be migrated first:" 1522 " rc = %d\n", mdt_obd_name(info->mti_mdt), Is mold locked at this point after mdd_object_find->lu_object_find? Otherwise it seems entirely possible to delete the object between the time it is looked up and when the LASSERT() trips in lu_object_attr(). |
| Comment by Di Wang [ 15/Jan/15 ] |
|
Well, the parent has been locked, and the mold object is gotten by name->FID lookup. If the name entry exists, but the object (mold) does not exist, it means the name entry is becoming a dangling entry during the racer. |
| Comment by Di Wang [ 15/Jan/15 ] |
|
this probably because we did not lock all of the children when migrating the directory. hmm |
| Comment by Lai Siyao [ 13/Apr/17 ] |
|
I met this in local test, and am looking into it now. |
| Comment by Lai Siyao [ 14/Apr/17 ] |
|
This looks to be in mdd_migrate_entries() -> mdd_object_type(child), which asserts child should exist, while it's not always true, because child is not locked, I'll push a patch soon. |
| Comment by Gerrit Updater [ 14/Apr/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/26620 |
| Comment by James Casper [ 14/Nov/17 ] |
|
Seen in master 2.10.55 b3667: https://testing.hpdd.intel.com/test_sessions/0a9fe899-314e-4874-bd82-8a966bc7ad88 |
| Comment by Bob Glossman (Inactive) [ 16/Nov/17 ] |
|
another on master: |
| Comment by Sarah Liu [ 29/Nov/17 ] |
|
on 2.10.2 |
| Comment by Gerrit Updater [ 04/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26620/ |
| Comment by Peter Jones [ 04/Jan/18 ] |
|
Lai Can this ticket be marked as resolved now or is Di's patch still needed too? Peter |
| Comment by Lai Siyao [ 05/Jan/18 ] |
|
Peter, it can be marked as resolved now, and Di's patch is not needed IMHO. |
| Comment by Peter Jones [ 05/Jan/18 ] |
|
ok - thanks! |
| Comment by Minh Diep [ 12/Feb/18 ] |
|
+1 on b2_10 https://testing.hpdd.intel.com/test_sets/cbf69ae0-0ed3-11e8-a6ad-52540065bddc |
| Comment by Gerrit Updater [ 12/Feb/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31281 |
| Comment by Gerrit Updater [ 19/Mar/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31281/ |