[LU-5069] Hit LBUG in DNE racer test: (lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed Created: 15/May/14  Updated: 03/Jun/14  Resolved: 03/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Sarah Liu Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

2MDS with 4MDT
8OST
1 client
build # http://build.whamcloud.com/job/lustre-reviews/23893/


Severity: 3
Rank (Obsolete): 13997

 Description   

on MDS2 hit the LBUG

LustreError: 2958:0:(mdd_dir.c:3954:mdd_migrate()) Skipped 15 previous similar messages
LustreError: 2830:0:(lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo_header->loh_attr & LOHA_EXISTS) != 0 ) failed: 
LustreError: 2830:0:(lu_object.h:852:lu_object_attr()) LBUG
Pid: 2830, comm: mdt00_003

Call Trace:

Message from  [<ffffffffa0399895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
syslogd@client-1 [<ffffffffa0399e97>] lbug_with_loc+0x47/0xb0 [libcfs]
8 at May 15 14:1 [<ffffffffa0f3ae97>] mdd_is_subdir+0x277/0x280 [mdd]
4:19 ...
 kern [<ffffffffa0e0f2ef>] mdt_rename_sanity+0xff/0x4a0 [mdt]
el:LustreError:  [<ffffffffa0e1321c>] mdt_reint_rename_internal+0xdc/0x1a80 [mdt]
2830:0:(lu_objec [<ffffffffa06e46f8>] ? ldlm_lock_enqueue+0x1c8/0x930 [ptlrpc]
t.h:852:lu_objec [<ffffffffa0703edb>] ? ldlm_cli_enqueue_local+0x28b/0x5e0 [ptlrpc]
t_attr()) ASSERT [<ffffffffa0e14e04>] mdt_reint_rename_or_migrate+0x244/0x660 [mdt]
ION( ((o)->lo_he [<ffffffffa0702bc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
ader->loh_attr & [<ffffffffa0704230>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
 LOHA_EXISTS) != [<ffffffffa0e15250>] mdt_reint_rename+0x10/0x20 [mdt]
 0 ) failed: 
 [<ffffffffa0e0d881>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0df2e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
 [<ffffffffa0df371b>] mdt_reint+0x6b/0x120 [mdt]

Message from [<ffffffffa078fe5c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 syslogd@client- [<ffffffffa073faea>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
18 at May 15 14: [<ffffffffa073edd0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
14:19 ...
 ker [<ffffffff8109ab56>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
nel:LustreError: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

 2830:0:(lu_objeLustreError: dumping log to /tmp/lustre-log.1400188459.2830
ct.h:852:lu_object_attr()) LBUG


 Comments   
Comment by Di Wang [ 15/May/14 ]

It seems caused by this

LU-4725 mdt: child-parent lock ordering in rename

change rename so that it always has parent-child lock ordering,
otherwise it may deadlock with other operations.

Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com>
Signed-off-by: Hongchao Zhang <hongchao.zhang@intel.com>
Change-Id: If676da82ca50a20a4bb3aadef0f81c9c5ed3cbcb
Xyratex-bug-id: MRP-1700
Reviewed-on: http://review.whamcloud.com/9538
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: wangdi <di.wang@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

Hmm, do mdt_sanity_check without ldlm lock protection seems a bit risky. at least it needs to check whether object exist before mdo_is_subdir

static int mdt_rename_sanity(struct mdt_thread_info *info,
                             const struct lu_fid *dir_fid,
                             const struct lu_fid *fid)

{
...............

                       <-------------------- check whether the object(dot) exists here.
                       rc = mdo_is_subdir(info->mti_env,
                                           mdt_object_child(dst), fid,
                                           &dst_fid);
                        mdt_object_put(info->mti_env, dst);
}

I will cook a patch.

Comment by Di Wang [ 15/May/14 ]

http://review.whamcloud.com/10340

Comment by Peter Jones [ 03/Jun/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:48:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.