Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11952

open+create resend can recreate a file after unlink

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      While testing a LU-11444 with a racer, we found a crash related to the assert in this patch.

      [  365.369900] Kernel panic - not syncing: LBUG
      [  365.369902] CPU: 1 PID: 11263 Comm: mdt00_026 Tainted: G           OE  ------------   3.10.0-327.36.3.x3.0.22.x86_64 #1
      [  365.369903] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  365.369906]  ffffffffa04d7a03 00000000df5aaa35 ffff880099457530 ffffffff81634b81
      [  365.369908]  ffff8800994575b0 ffffffff8162e410 ffffffff00000008 ffff8800994575c0
      [  365.369909]  ffff880099457560 00000000df5aaa35 ffffffffa08ed140 ffffffffa04e2525
      [  365.369910] Call Trace:
      [  365.369914]  [<ffffffff81634b81>] dump_stack+0x19/0x1b
      [  365.369916]  [<ffffffff8162e410>] panic+0xd8/0x1e7
      [  365.369924]  [<ffffffffa04b6013>] lbug_with_loc+0xb3/0xc0 [libcfs]
      [  365.369968]  [<ffffffffa08b69a7>] tgt_add_reply_data+0x217/0x220 [ptlrpc]
      [  365.370003]  [<ffffffffa08b8bc2>] tgt_last_rcvd_update+0x742/0xe40 [ptlrpc]
      [  365.370036]  [<ffffffffa08bda82>] tgt_txn_stop_cb+0x1a2/0x4a0 [ptlrpc]
      [  365.370045]  [<ffffffffa0a88b18>] ? osd_attr_set+0x218/0x960 [osd_ldiskfs]
      [  365.370066]  [<ffffffffa05ebc43>] dt_txn_hook_stop+0x63/0x80 [obdclass]
      [  365.370076]  [<ffffffffa0a8a5fd>] osd_trans_stop+0x12d/0x430 [osd_ldiskfs]
      [  365.370096]  [<ffffffffa060993e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [  365.370104]  [<ffffffffa0d9eace>] ? mdd_changelog_ns_store+0x2e/0x650 [mdd]
      [  365.370111]  [<ffffffffa0d325a2>] lod_trans_stop+0x52/0x890 [lod]
      [  365.370118]  [<ffffffffa0dbc270>] mdd_trans_stop+0x20/0x34 [mdd]
      [  365.370125]  [<ffffffffa0da6225>] mdd_create+0x1075/0x12a0 [mdd]
      [  365.370138]  [<ffffffffa0c79062>] mdt_reint_open+0x1f92/0x2e00 [mdt]
      [  365.370167]  [<ffffffffa0857da0>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
      [  365.370179]  [<ffffffffa0c7a48d>] mdt_reconstruct_open+0x5bd/0xb80 [mdt]
      [  365.370192]  [<ffffffffa0c7025c>] mdt_reconstruct+0x4c/0x160 [mdt]
      [  365.370204]  [<ffffffffa0c4cbd8>] mdt_reint_internal+0x998/0xb30 [mdt]
      [  365.370213]  [<ffffffffa0c4ced2>] mdt_intent_reint+0x162/0x420 [mdt]
      [  365.370222]  [<ffffffffa0c508f5>] mdt_intent_opc+0x215/0xb30 [mdt]
      [  365.370258]  [<ffffffffa085c480>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
      [  365.370268]  [<ffffffffa0c58668>] mdt_intent_policy+0x138/0x320 [mdt]
      [  365.370298]  [<ffffffffa0807373>] ldlm_lock_enqueue+0x343/0xae0 [ptlrpc]
      [  365.370307]  [<ffffffffa04c9cc0>] ? cfs_hash_lookup+0x90/0xc0 [libcfs]
      [  365.370339]  [<ffffffffa082ef23>] ldlm_handle_enqueue0+0x9c3/0x1840 [ptlrpc]
      [  365.370375]  [<ffffffffa085c500>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [  365.370411]  [<ffffffffa08c0cd2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [  365.370443]  [<ffffffffa08c59e9>] tgt_request_handle+0x949/0x1240 [ptlrpc]
      [  365.370480]  [<ffffffffa08676cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [  365.370519]  [<ffffffffa0864fa5>] ? ptlrpc_wait_event+0xa5/0x350 [ptlrpc]
      [  365.370549]  [<ffffffffa086b00e>] ptlrpc_main+0xc1e/0x2300 [ptlrpc]
      [  365.370551]  [<ffffffff810c22ee>] ? dequeue_task_fair+0x42e/0x640
      [  365.370553]  [<ffffffff810bb7d5>] ? sched_clock_cpu+0x85/0xc0
      [  365.370588]  [<ffffffffa086a3f0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
      [  365.370590]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [  365.370593]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [  365.370595]  [<ffffffff81644fd8>] ret_from_fork+0x58/0x90
      [  365.370597]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      

      This assert have checked we have no xid step down for the slot.

      Original idea was regression from a patch, but later we have decided it's bug in open replay code. Bug looks addressed to the valid situation when object may don't 'exist' or object fid is different than client requested.

              if (mdt_get_disposition(ldlm_rep, DISP_OPEN_CREATE)) {
                      struct obd_export *exp = req->rq_export;
                      /*
                       * We failed after creation, but we do not know in which step
                       * we failed. So try to check the child object.
                       */
                      parent = mdt_object_find(env, mdt, rr->rr_fid1);
                      if (IS_ERR(parent)) {
                              rc = PTR_ERR(parent);
                              LCONSOLE_WARN("Parent "DFID" lookup error %d."
                                            " Evicting client %s with export %s.\n",
                                            PFID(rr->rr_fid1), rc,
                                            obd_uuid2str(&exp->exp_client_uuid),
                                            obd_export_nid2str(exp));
                              mdt_export_evict(exp);
                              RETURN_EXIT;
                      }
      
                      child = mdt_object_find(env, mdt, rr->rr_fid2);
                      if (IS_ERR(child)) {
                              rc = PTR_ERR(child);
                              LCONSOLE_WARN("cannot lookup child "DFID": rc = %d; "
                                            "evicting client %s with export %s\n",
                                            PFID(rr->rr_fid2), rc,
                                            obd_uuid2str(&exp->exp_client_uuid),
                                            obd_export_nid2str(exp));
                              mdt_object_put(env, parent);
                              mdt_export_evict(exp);
                              RETURN_EXIT;
                      }
      
                      if (unlikely(mdt_object_remote(child))) {
                              /* the child object was created on remote server */
                              if (!mdt_is_dne_client(exp)) {
                                      /* Return -EIO for old client */
                                      mdt_object_put(env, parent);
                                      mdt_object_put(env, child);
                                      GOTO(out, rc = -EIO);
                              }
                              repbody->mbo_fid1 = *rr->rr_fid2;
                              repbody->mbo_valid |= (OBD_MD_FLID | OBD_MD_MDS);
                              rc = 0;
                      } else {
                              if (mdt_object_exists(child)) {
                                      mdt_prep_ma_buf_from_rep(info, child, ma);
                                      rc = mdt_attr_get_complex(info, child, ma);
                                      if (rc == 0)
                                              rc = mdt_finish_open(info, parent,
                                                                   child, open_flags,
                                                                   1, ldlm_rep);
                              } else {
                                      /* the child does not exist, we should do
                                       * regular open */
                                      mdt_object_put(env, parent);
                                      mdt_object_put(env, child);
                                      GOTO(regular_open, 0); <<<
                              }
                      }
                      mdt_object_put(env, parent);
                      mdt_object_put(env, child);
      

      operation order restored from requests history hash
      1) fileA created from client 1
      2) fileA open(O_CREATE) from client 2
      3) link to the client2 failed before reply send (from client view)
      4) unlink(fileA) from client1, it have blocked until reply arrived to the client2
      5) client2 started to resend a open(O_CREATE)
      6) unlink(fileA) was finished
      – it looks lu_site cache remove this object from cache in this time.
      7) resend handled by MDT and found request->fid2 isn't exist, so full create needs.

      ---- thats all.

      variant of this case.

      a) without unlink - p.2 need to found an object by name and reply with different fid, so resend will don't found a child by fid2 from request (object created without LOHA_EXIST)

      Attachments

        Issue Links

          Activity

            [LU-11952] open+create resend can recreate a file after unlink
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/35112/
            Subject: LU-11952 mdt: fix reconstruct open
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cf6ce3329f92f146206be8b63846d2c6e45c92d6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/35112/ Subject: LU-11952 mdt: fix reconstruct open Project: fs/lustre-release Branch: master Current Patch Set: Commit: cf6ce3329f92f146206be8b63846d2c6e45c92d6
            spitzcor Cory Spitz added a comment -

            2.15.0?

            spitzcor Cory Spitz added a comment - 2.15.0?
            spitzcor Cory Spitz added a comment -

            I guess we can shoot for 2.14.0 now.

            spitzcor Cory Spitz added a comment - I guess we can shoot for 2.14.0 now.
            spitzcor Cory Spitz added a comment -

            I understand that https://review.whamcloud.com/35112 is moving slowly, but it should be considered for L2.13 if at all possible.

            spitzcor Cory Spitz added a comment - I understand that https://review.whamcloud.com/35112 is moving slowly, but it should be considered for L2.13 if at all possible.

            Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/35112
            Subject: LU-11952 mdt: fix reconstruct open
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c0d28df84c5b946e1f8f35134a78b6e567ef5e15

            gerrit Gerrit Updater added a comment - Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/35112 Subject: LU-11952 mdt: fix reconstruct open Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c0d28df84c5b946e1f8f35134a78b6e567ef5e15
            gerrit Gerrit Updater added a comment - - edited

            Alexey Lyashkov (c17817@cray.com) uploaded a new patch:
            https://review.whamcloud.com/34266
            Subject: LU-11952 mdt: fix reconstruct_open
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 02ff43d8954460d43228fd866c274fdda81bfdc6

            gerrit Gerrit Updater added a comment - - edited Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/34266 Subject: LU-11952 mdt: fix reconstruct_open Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 02ff43d8954460d43228fd866c274fdda81bfdc6

            cross open(O_CREATE) don't have mfd also, it have create an object on remote mdt but no MFD was created in same time (mdt_finish_open skip).

            So good to save a real child fid in this case to avoid lookup by name and side effects of this change.

            shadow Alexey Lyashkov added a comment - cross open(O_CREATE) don't have mfd also, it have create an object on remote mdt but no MFD was created in same time (mdt_finish_open skip). So good to save a real child fid in this case to avoid lookup by name and side effects of this change.
            shadow Alexey Lyashkov added a comment - - edited

            Alex, current code have mfd search too later after req->r_fid2 checks. It's problem. As about solution, same way was chosen but several problems exist.
            1) mfd was don't exist with resend after recovery. if no open lock provided. client start create before fail, mdt failed before reply send. Client have think this is resend but oops. But say way will be in case client able to close this file in situation.
            client doing open, link fail, client have resend a open and send close immediately. But server able to delivery a reply (server don't know about link flap, it's just client side view). Reply arrived to the same open request as XID isn't changed.
            Based on LNet ideas - we should don't have assumptions about request order processing - so processing close before open resend is possible. mfd will destroyed in this case.. OOPS.
            so we need to separate mfd close vs mfd don't created.

            2) mfd list is large in most cases, scanning this list can take a lot time a specially if it twice (second one in mdt_finish_open).

            3) I worry about remote create in open(O_CREATE) case - what we should do in this case? did it possible to return different fid as provided as fid2 in client open?

            as about 1/2 it looks solution exist. tg_reply_data is completely 'in memory' object so easy to extent to the two additional fid.
            first one - mfd_exist, should resolve a problem with (1) and separate resend after replay (should be handled as replay) and clear resend.
            second one - open handle - support to fast search mfd via mdt_handle2mfd.

            (several hack's / optimisations possible - as open is schedule difficult reply always so we able to separate reply is delivered or not in mdt_steal_locks which scan for outstanding difficult reply's)

            shadow Alexey Lyashkov added a comment - - edited Alex, current code have mfd search too later after req->r_fid2 checks. It's problem. As about solution, same way was chosen but several problems exist. 1) mfd was don't exist with resend after recovery. if no open lock provided. client start create before fail, mdt failed before reply send. Client have think this is resend but oops. But say way will be in case client able to close this file in situation. client doing open, link fail, client have resend a open and send close immediately. But server able to delivery a reply (server don't know about link flap, it's just client side view). Reply arrived to the same open request as XID isn't changed. Based on LNet ideas - we should don't have assumptions about request order processing - so processing close before open resend is possible. mfd will destroyed in this case.. OOPS. so we need to separate mfd close vs mfd don't created. 2) mfd list is large in most cases, scanning this list can take a lot time a specially if it twice (second one in mdt_finish_open). 3) I worry about remote create in open(O_CREATE) case - what we should do in this case? did it possible to return different fid as provided as fid2 in client open? as about 1/2 it looks solution exist. tg_reply_data is completely 'in memory' object so easy to extent to the two additional fid. first one - mfd_exist, should resolve a problem with (1) and separate resend after replay (should be handled as replay) and clear resend. second one - open handle - support to fast search mfd via mdt_handle2mfd. (several hack's / optimisations possible - as open is schedule difficult reply always so we able to separate reply is delivered or not in mdt_steal_locks which scan for outstanding difficult reply's)

            sure, I don't recall all the details, but we do have XID in mfd:

            struct mdt_file_data {
            	/**  portals handle must be first */
            	struct portals_handle	mfd_open_handle;
            	/** open mode provided by client */
            	u64			mfd_open_flags;
            	/** protected by med_open_lock */
            	struct list_head	mfd_list;
            	/** xid of the open request */
            	__u64			mfd_xid;
            

            which should be enough to find the object and reconstruct the reply?

            bzzz Alex Zhuravlev added a comment - sure, I don't recall all the details, but we do have XID in mfd: struct mdt_file_data { /** portals handle must be first */ struct portals_handle mfd_open_handle; /** open mode provided by client */ u64 mfd_open_flags; /** protected by med_open_lock */ struct list_head mfd_list; /** xid of the open request */ __u64 mfd_xid; which should be enough to find the object and reconstruct the reply?

            People

              askulysh Andriy Skulysh
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: