[LU-1716] Race in setting connection flags and using them on 2.x client connect Created: 07/Aug/12  Updated: 22/Dec/12  Resolved: 26/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Blocker
Reporter: Andriy Skulysh Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4475

 Description   

Lustre 2.1 client fails to connect to Lustre 2.2 server

> c0-0c2s6n1 LustreError: 11-0: an error occurred while communicating with 10.149.3.5@o2ib. The mgs_config_read operation failed with -524
> c0-0c2s6n1 LustreError: 4645:0:(mgc_request.c:1917:mgc_process_config()) Cannot process recover llog -524
> c0-0c2s6n1 LustreError: 15c-8: MGC10.149.3.5@o2ib: The configuration from log 'snxs2-client' failed (-524). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
> c0-0c2s6n1 LustreError: 4645:0:(llite_lib.c:983:ll_fill_super()) Unable to process log: -524

the race can be reproduced with following patch:

diff --git a/lustre/ptlrpc/import.c b/lustre/ptlrpc/import.c
index 2953352..a69e6b9 100644
— a/lustre/ptlrpc/import.c
+++ b/lustre/ptlrpc/import.c
@@ -805,6 +805,7 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
                 } else {
                         IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
                         ptlrpc_activate_import(imp);
+                        OBD_FAIL_TIMEOUT(0x5555, 2);
                 }
 
                 GOTO(finish, rc = 0);


 Comments   
Comment by Andreas Dilger [ 07/Aug/12 ]

How is this different than LU-887?

Comment by Andriy Skulysh [ 07/Aug/12 ]

It looks like the answer is no. My case addresses the issue in setting connect flags. We should set connect flags prior to setting import to full state.
Patch http://review.whamcloud.com/#change,3555

Comment by Bob Glossman (Inactive) [ 07/Aug/12 ]

I tried to reproduce the reported problem with just repeated mount/unmount cycles of a 2.3 filesystem on a 2.1 client. Couldn't make it happen. Didn't add in your suggested OBD_FAIL_* hook to try to force it to happen. Should I have expected to see it or is there something else that needs to be done to cause it?

Comment by Andreas Dilger [ 09/Aug/12 ]

Bob/Andriy, it would be useful to add a proper OBD_FAIL value for this test, and write a recovery-small.sh test to trigger the timeout, and submit it as a patch to autotest with the following directives in the commit comment (see http://wiki.whamcloud.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages for details):

Test-Parameters: fortestonly clientbuildno=??? list=recovery-small,recovery-small,recovery-small

Where clientbuildno=<2.1.2 release build number>. This is without the actual fix applied, just the test case. That would give us an indication on how easily this problem can be hit, and then if it fails during the testing, then the test case should be added to the patch, which will presumably allow the test to pass.

Comment by Andriy Skulysh [ 15/Aug/12 ]

added test http://review.whamcloud.com/#change,3654 for b2_1
In fact it reproduces the bug in another place:
00000100:00000001:0.0:1344768393.872519:0:4160:0:(client.c:617:__ptlrpc_request_bufs_pack()) Process leaving (rc=0 : 0 : 0)
00000000:00040000:0.0:1344768393.872521:0:4160:0:(mdc_request.c:239:mdc_getattr()) ASSERTION(client_is_remote(exp)) failed
00000000:00040000:0.0:1344768393.872527:0:4160:0:(mdc_request.c:239:mdc_getattr()) LBUG

but it is connection flags race also. The LBUG dissapears with my fix.

Comment by Bob Glossman (Inactive) [ 15/Aug/12 ]

Andreas, I can incorporate the test from http://review.whamcloud.com/#change,3654 into the fix for master. Since this bug is an interop problem it won't prove anything with the test only in 2.3 client & server, but should be good to prove the problem doesn't come back for testing interop of 2.3 with future versions. Will also rework the patch to address your style and cleanup comments and resubmit.

Comment by Bob Glossman (Inactive) [ 15/Aug/12 ]

Never mind. I see I am playing catch up here. Looks like Andriy has already addressed the style and cleanup issues but not the test. That being the case should I grab the 2.1 test into master even though it doesn't prove anything in this release?

Comment by Andriy Skulysh [ 15/Aug/12 ]

I haven't added test to master branch because the fix changes the order of assigning connect flags and setting import to FULL state. OBD_FAIL_TIMEOUT() will not catch anything.

Comment by Oleg Drokin [ 25/Aug/12 ]

Hm, actually I now believe the issue is more serious than interop. Since the flags like CAPA, GSS and remote clients are set in a client by default and only after server rejects them they are reset, this could occur without any interop again, though strange that we have not seen it before (needs a lot of superfast cores on client to show up the race?)

I will change the title to reflect this.

Comment by Andriy Skulysh [ 25/Aug/12 ]

b2_1 patch http://review.whamcloud.com/3783

Comment by Peter Jones [ 26/Aug/12 ]

Landed for 2.3 and 2.4

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-577

Generated at Sat Feb 10 01:19:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.