[LU-3452] 1.8 <> 2.4 interop broken due job id patch. Created: 11/Jun/13  Updated: 22/Nov/14  Resolved: 22/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3, Lustre 1.8.x (1.8.0 - 1.8.5), Lustre 1.8.9
Fix Version/s: Lustre 2.1.4, Lustre 1.8.9

Type: Bug Priority: Major
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2309 Uninteroperable conf_params not docum... Resolved
Severity: 3
Rank (Obsolete): 8632

 Description   

1.8 client don't able to start with 2.4 now (and 2.3 if i correctly understand).

ustreError: 152-6: Ignoring deprecated mount option 'acl'.
Lustre: 5036:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv
Lustre: 5036:0:(obd_config.c:875:class_process_config()) Ignoring unknown param jobid_var=procname_uid
LustreError: 5036:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
Lustre:    cmd=cf00f 0:<NULL>  1:sys.jobid_var=procname_uid  2:procname_uid  
LustreError: 15b-f: MGC192.168.69.5@tcp: The configuration from log 'lustre-client' failed (-22). Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC192.168.69.5@tcp: The configuration from log 'lustre-client' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.


 Comments   
Comment by Andreas Dilger [ 11/Jun/13 ]

This is only true if the JobID feature is enabled, which it isn't by default. The jobid feature is enabled by default by the test framework but will not be when the filesystem is formatted by a user, or during an upgrade.

There is a patch for b1_8 and b2_1 to allow the client to skip unknown conf_param settings which should be applied to clients that need to interoperate with servers that have JobId enabled. There isn't a patch to backport JobId to those clients.

Comment by Alexey Lyashkov [ 11/Jun/13 ]

Andreas,

i know about it's patch, but it's broke an interoperability testing for now. and i'm understand about "why it's hit", my question more likely to be "why job_id parameter send to non 2.4 clients".
Some customers have a lots a 1.8 or older 2.x clients without job id or ignore a invalid parameter patch, so it's should be fixed in lustre to have interoperability works.

Comment by Andreas Dilger [ 11/Jun/13 ]

JobID is not a "required" feature for users (i.e. it won't break if it isn't enabled, and it can be fixed with a small patch in older clients).

As with many new Lustre features, there is no easy way to add new fixes to older clients (e.g. DNE, large xattrs, HSM, etc), so those features should not be enabled by sysadmins if the clients do not support them, and JobID is not enabled by default. It is unfortunate that this interoperability problem was not found before 2.3.0 was released, but it is impossible for us to test every possible combination of features enabled and disabled with every combination of client version and server version, and in this case we didn't catch the problem in time.

The conf_param setting in the MGS config log does not check for compatibility of llog entries between client versions. Adding the MGS understand Lustre feature compatibility is much more complex than having the client ignore configuration parameters that it doesn't understand. The client should have already been ignoring such unknown parameters, but there was a bug in that code (LU-2309 was fixed in 2.1.4 and 2.4.0, and landed to b1_8 though after the 1.8.9 release).

If you can think of a simple way to have the MGS hide this config record from the client that might be useful. Clients already connect to the OSS and MDS with OBD_CONNECT_JOBSTATS, so that the client/server can decide what size RPC message they should use, but they don't pass this flag to the MGS. The difficulty is adding a check for OBD_CONNECT_JOBSTATS on the MGS to the clients now means that the JobID code would break for 2.3.0 and 2.4.0 clients as well. It might be enough to have the MGS check for OBD_CONNECT_FULL20, which would leave only the 2.1.0-2.1.3 clients unable to interoperate, but they are much more likely to be updated to a newer release that can ignore the conf_param errors than older 1.8 clients.

Comment by Andreas Dilger [ 22/Nov/14 ]

Since a patch to fix this was landed to b1_8 (after 1.8.9) and for 2.1.4, and this problem is only hit if the sysadmin explicitly enables jobstats I'm going to mark this fixed.

Generated at Sat Feb 10 01:34:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.