[LU-7115] fld_client_rpc() may run into deadloop Created: 08/Sep/15  Updated: 24/Jan/17  Resolved: 24/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Niu Yawei (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5312 sanity test_161a: cannot create regul... Resolved
is related to LU-6419 Fld client lookup should retry anothe... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In fld_client_rpc():

        if (rc != 0) {
                if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                        /* Since LWP is not replayable, so it will keep
                         * trying unless umount happens, otherwise it would
                         * cause unecessary failure of the application. */
                        ptlrpc_req_finished(req);
                        rc = 0;
                        goto again;
                }
                GOTO(out_req, rc);
        }

If the connection is broken, this function will run into an dead loop. I think we'd reshape the function somehow to make it interruptable, otherwise, if connection never being established, caller will stuck in this function forever.

Seems fld_update_from_controller() has similar problem.



 Comments   
Comment by Peter Jones [ 08/Sep/15 ]

Yang Sheng

Could you please look into this issue?

Thanks

Peter

Comment by Di Wang [ 08/Sep/15 ]

I thought the point here is to not fail for LWP unless it is being umounted or deactive, not sure how interruptible can help here, since it is the connection between MDTs. I may miss sth?

But we do need check if the import is for LWP, i.e. only do this "try again" for LWP connection, not for MDC or other import.

Comment by Niu Yawei (Inactive) [ 09/Sep/15 ]

I mean if the connection is broken, then it'll run into a deadloop, then there is no way to terminate the thread which calls this function, and we won't able to shutdown the MDT/OST at the end.

I think it's not a serious problem, and looks not easy to fix.

Comment by Di Wang [ 09/Sep/15 ]

Oh, it can break the loop, see

               if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
                        /* Since LWP is not replayable, so it will keep
                         * trying unless umount happens, otherwise it would
                         * cause unecessary failure of the application. */
                        ptlrpc_req_finished(req);
                        rc = 0;
                        goto again;
                }

It will check the import state here. And also the point right now is that we do not break the connection between MDTs, until umount and admin step in, so this implementation actually fit in here.
Though we still need check the LWP here as I said in the previous comment.

Comment by Andreas Dilger [ 15/Sep/15 ]

Is this really a bug or could this be closed?

Comment by Niu Yawei (Inactive) [ 16/Sep/15 ]

As Di mentioned, the thread can be terminated by umount target, I think that's fine to me, we can just leave as it is.

And this function will be called by client as well, we may need to check if it's called by client (not from LWP but from mdc device), and break the loop for non-LWP device case.

Comment by Gerrit Updater [ 04/Nov/15 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/17041
Subject: LU-7115 fld: don't try again for no LWP device
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4fb4d98b6432e2a9d2e0397599421a1c2032e51e

Comment by Gerrit Updater [ 24/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17041/
Subject: LU-7115 fld: don't retry for no LWP device
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9407629a816feff9f773517f90b615164319642f

Comment by Peter Jones [ 24/Jan/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:06:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.