[LU-11952] open+create resend can recreate a file after unlink Created: 11/Feb/19 Updated: 13/Dec/21 Resolved: 13/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Andriy Skulysh |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
While testing a [ 365.369900] Kernel panic - not syncing: LBUG [ 365.369902] CPU: 1 PID: 11263 Comm: mdt00_026 Tainted: G OE ------------ 3.10.0-327.36.3.x3.0.22.x86_64 #1 [ 365.369903] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 365.369906] ffffffffa04d7a03 00000000df5aaa35 ffff880099457530 ffffffff81634b81 [ 365.369908] ffff8800994575b0 ffffffff8162e410 ffffffff00000008 ffff8800994575c0 [ 365.369909] ffff880099457560 00000000df5aaa35 ffffffffa08ed140 ffffffffa04e2525 [ 365.369910] Call Trace: [ 365.369914] [<ffffffff81634b81>] dump_stack+0x19/0x1b [ 365.369916] [<ffffffff8162e410>] panic+0xd8/0x1e7 [ 365.369924] [<ffffffffa04b6013>] lbug_with_loc+0xb3/0xc0 [libcfs] [ 365.369968] [<ffffffffa08b69a7>] tgt_add_reply_data+0x217/0x220 [ptlrpc] [ 365.370003] [<ffffffffa08b8bc2>] tgt_last_rcvd_update+0x742/0xe40 [ptlrpc] [ 365.370036] [<ffffffffa08bda82>] tgt_txn_stop_cb+0x1a2/0x4a0 [ptlrpc] [ 365.370045] [<ffffffffa0a88b18>] ? osd_attr_set+0x218/0x960 [osd_ldiskfs] [ 365.370066] [<ffffffffa05ebc43>] dt_txn_hook_stop+0x63/0x80 [obdclass] [ 365.370076] [<ffffffffa0a8a5fd>] osd_trans_stop+0x12d/0x430 [osd_ldiskfs] [ 365.370096] [<ffffffffa060993e>] ? lu_ucred+0x1e/0x30 [obdclass] [ 365.370104] [<ffffffffa0d9eace>] ? mdd_changelog_ns_store+0x2e/0x650 [mdd] [ 365.370111] [<ffffffffa0d325a2>] lod_trans_stop+0x52/0x890 [lod] [ 365.370118] [<ffffffffa0dbc270>] mdd_trans_stop+0x20/0x34 [mdd] [ 365.370125] [<ffffffffa0da6225>] mdd_create+0x1075/0x12a0 [mdd] [ 365.370138] [<ffffffffa0c79062>] mdt_reint_open+0x1f92/0x2e00 [mdt] [ 365.370167] [<ffffffffa0857da0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [ 365.370179] [<ffffffffa0c7a48d>] mdt_reconstruct_open+0x5bd/0xb80 [mdt] [ 365.370192] [<ffffffffa0c7025c>] mdt_reconstruct+0x4c/0x160 [mdt] [ 365.370204] [<ffffffffa0c4cbd8>] mdt_reint_internal+0x998/0xb30 [mdt] [ 365.370213] [<ffffffffa0c4ced2>] mdt_intent_reint+0x162/0x420 [mdt] [ 365.370222] [<ffffffffa0c508f5>] mdt_intent_opc+0x215/0xb30 [mdt] [ 365.370258] [<ffffffffa085c480>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] [ 365.370268] [<ffffffffa0c58668>] mdt_intent_policy+0x138/0x320 [mdt] [ 365.370298] [<ffffffffa0807373>] ldlm_lock_enqueue+0x343/0xae0 [ptlrpc] [ 365.370307] [<ffffffffa04c9cc0>] ? cfs_hash_lookup+0x90/0xc0 [libcfs] [ 365.370339] [<ffffffffa082ef23>] ldlm_handle_enqueue0+0x9c3/0x1840 [ptlrpc] [ 365.370375] [<ffffffffa085c500>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [ 365.370411] [<ffffffffa08c0cd2>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 365.370443] [<ffffffffa08c59e9>] tgt_request_handle+0x949/0x1240 [ptlrpc] [ 365.370480] [<ffffffffa08676cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 365.370519] [<ffffffffa0864fa5>] ? ptlrpc_wait_event+0xa5/0x350 [ptlrpc] [ 365.370549] [<ffffffffa086b00e>] ptlrpc_main+0xc1e/0x2300 [ptlrpc] [ 365.370551] [<ffffffff810c22ee>] ? dequeue_task_fair+0x42e/0x640 [ 365.370553] [<ffffffff810bb7d5>] ? sched_clock_cpu+0x85/0xc0 [ 365.370588] [<ffffffffa086a3f0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc] [ 365.370590] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 [ 365.370593] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 [ 365.370595] [<ffffffff81644fd8>] ret_from_fork+0x58/0x90 [ 365.370597] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 This assert have checked we have no xid step down for the slot. Original idea was regression from a patch, but later we have decided it's bug in open replay code. Bug looks addressed to the valid situation when object may don't 'exist' or object fid is different than client requested.
if (mdt_get_disposition(ldlm_rep, DISP_OPEN_CREATE)) {
struct obd_export *exp = req->rq_export;
/*
* We failed after creation, but we do not know in which step
* we failed. So try to check the child object.
*/
parent = mdt_object_find(env, mdt, rr->rr_fid1);
if (IS_ERR(parent)) {
rc = PTR_ERR(parent);
LCONSOLE_WARN("Parent "DFID" lookup error %d."
" Evicting client %s with export %s.\n",
PFID(rr->rr_fid1), rc,
obd_uuid2str(&exp->exp_client_uuid),
obd_export_nid2str(exp));
mdt_export_evict(exp);
RETURN_EXIT;
}
child = mdt_object_find(env, mdt, rr->rr_fid2);
if (IS_ERR(child)) {
rc = PTR_ERR(child);
LCONSOLE_WARN("cannot lookup child "DFID": rc = %d; "
"evicting client %s with export %s\n",
PFID(rr->rr_fid2), rc,
obd_uuid2str(&exp->exp_client_uuid),
obd_export_nid2str(exp));
mdt_object_put(env, parent);
mdt_export_evict(exp);
RETURN_EXIT;
}
if (unlikely(mdt_object_remote(child))) {
/* the child object was created on remote server */
if (!mdt_is_dne_client(exp)) {
/* Return -EIO for old client */
mdt_object_put(env, parent);
mdt_object_put(env, child);
GOTO(out, rc = -EIO);
}
repbody->mbo_fid1 = *rr->rr_fid2;
repbody->mbo_valid |= (OBD_MD_FLID | OBD_MD_MDS);
rc = 0;
} else {
if (mdt_object_exists(child)) {
mdt_prep_ma_buf_from_rep(info, child, ma);
rc = mdt_attr_get_complex(info, child, ma);
if (rc == 0)
rc = mdt_finish_open(info, parent,
child, open_flags,
1, ldlm_rep);
} else {
/* the child does not exist, we should do
* regular open */
mdt_object_put(env, parent);
mdt_object_put(env, child);
GOTO(regular_open, 0); <<<
}
}
mdt_object_put(env, parent);
mdt_object_put(env, child);
operation order restored from requests history hash ---- thats all. variant of this case. a) without unlink - p.2 need to found an object by name and reply with different fid, so resend will don't found a child by fid2 from request (object created without LOHA_EXIST) |
| Comments |
| Comment by Alex Zhuravlev [ 11/Feb/19 ] |
|
if fileA was opened (i.e. step 2 were executed except successful reply), then opencount is non-zero and unlink can't destroy the object. |
| Comment by Alex Zhuravlev [ 11/Feb/19 ] |
|
forgot to mention, that open normally stores FID in last_rcvd so that reconstruct can find the object w/o name. IIRC |
| Comment by Mikhail Pershin [ 11/Feb/19 ] |
|
Alex, I think the opened file is still open-unlinked but its FID is FID1 from create at step 1, while resend is made with FID2 from client2. At step 2 it found file by name and open it (its FID is FID1) but during resend name doesn't exist and new file is created with the same name and FID2. It looks like possible case but I don't see why this is critical error after all, that is valid race, how is it different from the case when unlink from client 1 comes earlier than open from client2? |
| Comment by Alexey Lyashkov [ 11/Feb/19 ] |
|
Alex, reconstruct open isn't take any fid's from last_rcvd and don't store anything except a disposition.
struct lsd_reply_data {
__u64 lrd_transno; /* transaction number */
__u64 lrd_xid; /* transmission id */
__u64 lrd_data; /* per-operation data */
__u32 lrd_result; /* request result */
__u32 lrd_client_gen; /* client generation */
};
struct tg_reply_data {
/** chain of reply data anchored in tg_export_data */
struct list_head trd_list;
/** copy of on-disk reply data */
struct lsd_reply_data trd_reply;
/** versions for Version Based Recovery */
__u64 trd_pre_versions[4];
/** slot index in reply_data file */
int trd_index;
/** tag the client used */
__u16 trd_tag;
};
it will be best solution in case if reply cache will store an open handle and file xid, but nothing now. |
| Comment by Alexey Lyashkov [ 11/Feb/19 ] |
|
Mikhail it's problem from recovery view and execute-once semantic. As different result will be in case fail hit and don't hit between unlink and resend. |
| Comment by Alex Zhuravlev [ 11/Feb/19 ] |
|
sure, I don't recall all the details, but we do have XID in mfd:
struct mdt_file_data {
/** portals handle must be first */
struct portals_handle mfd_open_handle;
/** open mode provided by client */
u64 mfd_open_flags;
/** protected by med_open_lock */
struct list_head mfd_list;
/** xid of the open request */
__u64 mfd_xid;
which should be enough to find the object and reconstruct the reply? |
| Comment by Alexey Lyashkov [ 11/Feb/19 ] |
|
Alex, current code have mfd search too later after req->r_fid2 checks. It's problem. As about solution, same way was chosen but several problems exist. 2) mfd list is large in most cases, scanning this list can take a lot time a specially if it twice (second one in mdt_finish_open). 3) I worry about remote create in open(O_CREATE) case - what we should do in this case? did it possible to return different fid as provided as fid2 in client open? as about 1/2 it looks solution exist. tg_reply_data is completely 'in memory' object so easy to extent to the two additional fid. (several hack's / optimisations possible - as open is schedule difficult reply always so we able to separate reply is delivered or not in mdt_steal_locks which scan for outstanding difficult reply's) |
| Comment by Alexey Lyashkov [ 12/Feb/19 ] |
|
cross open(O_CREATE) don't have mfd also, it have create an object on remote mdt but no MFD was created in same time (mdt_finish_open skip). So good to save a real child fid in this case to avoid lookup by name and side effects of this change. |
| Comment by Gerrit Updater [ 15/Feb/19 ] |
|
|
| Comment by Gerrit Updater [ 09/Jun/19 ] |
|
Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/35112 |
| Comment by Cory Spitz [ 20/Oct/19 ] |
|
I understand that https://review.whamcloud.com/35112 is moving slowly, but it should be considered for L2.13 if at all possible. |
| Comment by Cory Spitz [ 26/Feb/20 ] |
|
I guess we can shoot for 2.14.0 now. |
| Comment by Cory Spitz [ 26/Mar/21 ] |
|
2.15.0? |
| Comment by Gerrit Updater [ 13/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/35112/ |
| Comment by Peter Jones [ 13/Dec/21 ] |
|
Landed for 2.15 |