[LU-11952] open+create resend can recreate a file after unlink Created: 11/Feb/19  Updated: 13/Dec/21  Resolved: 13/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Andriy Skulysh
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-11444 RPC resend may corrupt the data Resolved
is related to LU-11836 DOM read-open resend vs getattr deadlock Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing a LU-11444 with a racer, we found a crash related to the assert in this patch.

[  365.369900] Kernel panic - not syncing: LBUG
[  365.369902] CPU: 1 PID: 11263 Comm: mdt00_026 Tainted: G           OE  ------------   3.10.0-327.36.3.x3.0.22.x86_64 #1
[  365.369903] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  365.369906]  ffffffffa04d7a03 00000000df5aaa35 ffff880099457530 ffffffff81634b81
[  365.369908]  ffff8800994575b0 ffffffff8162e410 ffffffff00000008 ffff8800994575c0
[  365.369909]  ffff880099457560 00000000df5aaa35 ffffffffa08ed140 ffffffffa04e2525
[  365.369910] Call Trace:
[  365.369914]  [<ffffffff81634b81>] dump_stack+0x19/0x1b
[  365.369916]  [<ffffffff8162e410>] panic+0xd8/0x1e7
[  365.369924]  [<ffffffffa04b6013>] lbug_with_loc+0xb3/0xc0 [libcfs]
[  365.369968]  [<ffffffffa08b69a7>] tgt_add_reply_data+0x217/0x220 [ptlrpc]
[  365.370003]  [<ffffffffa08b8bc2>] tgt_last_rcvd_update+0x742/0xe40 [ptlrpc]
[  365.370036]  [<ffffffffa08bda82>] tgt_txn_stop_cb+0x1a2/0x4a0 [ptlrpc]
[  365.370045]  [<ffffffffa0a88b18>] ? osd_attr_set+0x218/0x960 [osd_ldiskfs]
[  365.370066]  [<ffffffffa05ebc43>] dt_txn_hook_stop+0x63/0x80 [obdclass]
[  365.370076]  [<ffffffffa0a8a5fd>] osd_trans_stop+0x12d/0x430 [osd_ldiskfs]
[  365.370096]  [<ffffffffa060993e>] ? lu_ucred+0x1e/0x30 [obdclass]
[  365.370104]  [<ffffffffa0d9eace>] ? mdd_changelog_ns_store+0x2e/0x650 [mdd]
[  365.370111]  [<ffffffffa0d325a2>] lod_trans_stop+0x52/0x890 [lod]
[  365.370118]  [<ffffffffa0dbc270>] mdd_trans_stop+0x20/0x34 [mdd]
[  365.370125]  [<ffffffffa0da6225>] mdd_create+0x1075/0x12a0 [mdd]
[  365.370138]  [<ffffffffa0c79062>] mdt_reint_open+0x1f92/0x2e00 [mdt]
[  365.370167]  [<ffffffffa0857da0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
[  365.370179]  [<ffffffffa0c7a48d>] mdt_reconstruct_open+0x5bd/0xb80 [mdt]
[  365.370192]  [<ffffffffa0c7025c>] mdt_reconstruct+0x4c/0x160 [mdt]
[  365.370204]  [<ffffffffa0c4cbd8>] mdt_reint_internal+0x998/0xb30 [mdt]
[  365.370213]  [<ffffffffa0c4ced2>] mdt_intent_reint+0x162/0x420 [mdt]
[  365.370222]  [<ffffffffa0c508f5>] mdt_intent_opc+0x215/0xb30 [mdt]
[  365.370258]  [<ffffffffa085c480>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
[  365.370268]  [<ffffffffa0c58668>] mdt_intent_policy+0x138/0x320 [mdt]
[  365.370298]  [<ffffffffa0807373>] ldlm_lock_enqueue+0x343/0xae0 [ptlrpc]
[  365.370307]  [<ffffffffa04c9cc0>] ? cfs_hash_lookup+0x90/0xc0 [libcfs]
[  365.370339]  [<ffffffffa082ef23>] ldlm_handle_enqueue0+0x9c3/0x1840 [ptlrpc]
[  365.370375]  [<ffffffffa085c500>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[  365.370411]  [<ffffffffa08c0cd2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  365.370443]  [<ffffffffa08c59e9>] tgt_request_handle+0x949/0x1240 [ptlrpc]
[  365.370480]  [<ffffffffa08676cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[  365.370519]  [<ffffffffa0864fa5>] ? ptlrpc_wait_event+0xa5/0x350 [ptlrpc]
[  365.370549]  [<ffffffffa086b00e>] ptlrpc_main+0xc1e/0x2300 [ptlrpc]
[  365.370551]  [<ffffffff810c22ee>] ? dequeue_task_fair+0x42e/0x640
[  365.370553]  [<ffffffff810bb7d5>] ? sched_clock_cpu+0x85/0xc0
[  365.370588]  [<ffffffffa086a3f0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
[  365.370590]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[  365.370593]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[  365.370595]  [<ffffffff81644fd8>] ret_from_fork+0x58/0x90
[  365.370597]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140

This assert have checked we have no xid step down for the slot.

Original idea was regression from a patch, but later we have decided it's bug in open replay code. Bug looks addressed to the valid situation when object may don't 'exist' or object fid is different than client requested.

        if (mdt_get_disposition(ldlm_rep, DISP_OPEN_CREATE)) {
                struct obd_export *exp = req->rq_export;
                /*
                 * We failed after creation, but we do not know in which step
                 * we failed. So try to check the child object.
                 */
                parent = mdt_object_find(env, mdt, rr->rr_fid1);
                if (IS_ERR(parent)) {
                        rc = PTR_ERR(parent);
                        LCONSOLE_WARN("Parent "DFID" lookup error %d."
                                      " Evicting client %s with export %s.\n",
                                      PFID(rr->rr_fid1), rc,
                                      obd_uuid2str(&exp->exp_client_uuid),
                                      obd_export_nid2str(exp));
                        mdt_export_evict(exp);
                        RETURN_EXIT;
                }

                child = mdt_object_find(env, mdt, rr->rr_fid2);
                if (IS_ERR(child)) {
                        rc = PTR_ERR(child);
                        LCONSOLE_WARN("cannot lookup child "DFID": rc = %d; "
                                      "evicting client %s with export %s\n",
                                      PFID(rr->rr_fid2), rc,
                                      obd_uuid2str(&exp->exp_client_uuid),
                                      obd_export_nid2str(exp));
                        mdt_object_put(env, parent);
                        mdt_export_evict(exp);
                        RETURN_EXIT;
                }

                if (unlikely(mdt_object_remote(child))) {
                        /* the child object was created on remote server */
                        if (!mdt_is_dne_client(exp)) {
                                /* Return -EIO for old client */
                                mdt_object_put(env, parent);
                                mdt_object_put(env, child);
                                GOTO(out, rc = -EIO);
                        }
                        repbody->mbo_fid1 = *rr->rr_fid2;
                        repbody->mbo_valid |= (OBD_MD_FLID | OBD_MD_MDS);
                        rc = 0;
                } else {
                        if (mdt_object_exists(child)) {
                                mdt_prep_ma_buf_from_rep(info, child, ma);
                                rc = mdt_attr_get_complex(info, child, ma);
                                if (rc == 0)
                                        rc = mdt_finish_open(info, parent,
                                                             child, open_flags,
                                                             1, ldlm_rep);
                        } else {
                                /* the child does not exist, we should do
                                 * regular open */
                                mdt_object_put(env, parent);
                                mdt_object_put(env, child);
                                GOTO(regular_open, 0); <<<
                        }
                }
                mdt_object_put(env, parent);
                mdt_object_put(env, child);

operation order restored from requests history hash
1) fileA created from client 1
2) fileA open(O_CREATE) from client 2
3) link to the client2 failed before reply send (from client view)
4) unlink(fileA) from client1, it have blocked until reply arrived to the client2
5) client2 started to resend a open(O_CREATE)
6) unlink(fileA) was finished
– it looks lu_site cache remove this object from cache in this time.
7) resend handled by MDT and found request->fid2 isn't exist, so full create needs.

---- thats all.

variant of this case.

a) without unlink - p.2 need to found an object by name and reply with different fid, so resend will don't found a child by fid2 from request (object created without LOHA_EXIST)



 Comments   
Comment by Alex Zhuravlev [ 11/Feb/19 ]

if fileA was opened (i.e. step 2 were executed except successful reply), then opencount is non-zero and unlink can't destroy the object.
at the moment I don't understand how opencount can get to 0 unless client2 is evicted.

Comment by Alex Zhuravlev [ 11/Feb/19 ]

forgot to mention, that open normally stores FID in last_rcvd so that reconstruct can find the object w/o name. IIRC

Comment by Mikhail Pershin [ 11/Feb/19 ]

Alex, I think the opened file is still open-unlinked but its FID is FID1 from create at step 1, while resend is made with FID2 from client2. At step 2 it found file by name and open it (its FID is FID1) but during resend name doesn't exist and new file is created with the same name and FID2. It looks like possible case but I don't see why this is critical error after all, that is valid race, how is it different from the case when unlink from client 1 comes earlier than open from client2?

Comment by Alexey Lyashkov [ 11/Feb/19 ]

Alex, reconstruct open isn't take any fid's from last_rcvd and don't store anything except a disposition.

struct lsd_reply_data {
        __u64 lrd_transno;      /* transaction number */
        __u64 lrd_xid;          /* transmission id */
        __u64 lrd_data;         /* per-operation data */
        __u32 lrd_result;       /* request result */
        __u32 lrd_client_gen;   /* client generation */
};
struct tg_reply_data {
        /** chain of reply data anchored in tg_export_data */
        struct list_head        trd_list;
        /** copy of on-disk reply data */
        struct lsd_reply_data   trd_reply;
        /** versions for Version Based Recovery */
        __u64                   trd_pre_versions[4];
        /** slot index in reply_data file */
        int                     trd_index;
        /** tag the client used */
        __u16                   trd_tag;
};

it will be best solution in case if reply cache will store an open handle and file xid, but nothing now.

Comment by Alexey Lyashkov [ 11/Feb/19 ]

Mikhail it's problem from recovery view and execute-once semantic. As different result will be in case fail hit and don't hit between unlink and resend.
object will created in case fail don't hit, but if crash it exist we don't have an object created. because create request can removed from replay queue due commit update and don't join into recovery. So it's not a valid race I think.
MDT have create an object when none expect to have this reply.

Comment by Alex Zhuravlev [ 11/Feb/19 ]

sure, I don't recall all the details, but we do have XID in mfd:

struct mdt_file_data {
	/**  portals handle must be first */
	struct portals_handle	mfd_open_handle;
	/** open mode provided by client */
	u64			mfd_open_flags;
	/** protected by med_open_lock */
	struct list_head	mfd_list;
	/** xid of the open request */
	__u64			mfd_xid;

which should be enough to find the object and reconstruct the reply?

Comment by Alexey Lyashkov [ 11/Feb/19 ]

Alex, current code have mfd search too later after req->r_fid2 checks. It's problem. As about solution, same way was chosen but several problems exist.
1) mfd was don't exist with resend after recovery. if no open lock provided. client start create before fail, mdt failed before reply send. Client have think this is resend but oops. But say way will be in case client able to close this file in situation.
client doing open, link fail, client have resend a open and send close immediately. But server able to delivery a reply (server don't know about link flap, it's just client side view). Reply arrived to the same open request as XID isn't changed.
Based on LNet ideas - we should don't have assumptions about request order processing - so processing close before open resend is possible. mfd will destroyed in this case.. OOPS.
so we need to separate mfd close vs mfd don't created.

2) mfd list is large in most cases, scanning this list can take a lot time a specially if it twice (second one in mdt_finish_open).

3) I worry about remote create in open(O_CREATE) case - what we should do in this case? did it possible to return different fid as provided as fid2 in client open?

as about 1/2 it looks solution exist. tg_reply_data is completely 'in memory' object so easy to extent to the two additional fid.
first one - mfd_exist, should resolve a problem with (1) and separate resend after replay (should be handled as replay) and clear resend.
second one - open handle - support to fast search mfd via mdt_handle2mfd.

(several hack's / optimisations possible - as open is schedule difficult reply always so we able to separate reply is delivered or not in mdt_steal_locks which scan for outstanding difficult reply's)

Comment by Alexey Lyashkov [ 12/Feb/19 ]

cross open(O_CREATE) don't have mfd also, it have create an object on remote mdt but no MFD was created in same time (mdt_finish_open skip).

So good to save a real child fid in this case to avoid lookup by name and side effects of this change.

Comment by Gerrit Updater [ 15/Feb/19 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch:
https://review.whamcloud.com/34266
Subject: LU-11952 mdt: fix reconstruct_open
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 02ff43d8954460d43228fd866c274fdda81bfdc6

Comment by Gerrit Updater [ 09/Jun/19 ]

Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/35112
Subject: LU-11952 mdt: fix reconstruct open
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0d28df84c5b946e1f8f35134a78b6e567ef5e15

Comment by Cory Spitz [ 20/Oct/19 ]

I understand that https://review.whamcloud.com/35112 is moving slowly, but it should be considered for L2.13 if at all possible.

Comment by Cory Spitz [ 26/Feb/20 ]

I guess we can shoot for 2.14.0 now.

Comment by Cory Spitz [ 26/Mar/21 ]

2.15.0?

Comment by Gerrit Updater [ 13/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/35112/
Subject: LU-11952 mdt: fix reconstruct open
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf6ce3329f92f146206be8b63846d2c6e45c92d6

Comment by Peter Jones [ 13/Dec/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:48:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.