[LU-2157] rolling downgrade from 2.3.0 to 1.8.8-wc1 failed Created: 12/Oct/12  Updated: 28/Feb/18  Resolved: 28/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2308 class_process_config() prints confusi... Resolved
Severity: 3
Rank (Obsolete): 5182

 Description   

After successfully rolling upgrade from Lustre 1.8.8-wc1 to 2.3.0 RC2 with the path of OSS->MDS->Client, rolling downgrade with path of Client->MDS->OSS failed at the mounting 1.8.8-wc1 client stage:

mount.lustre: mount fat-amd-1:/lustre at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.

Dmesg showed that:

Lustre: Server MGS version (2.3.0.0) is much newer than client version (1.8.8)
Lustre: 6967:0:(obd_config.c:875:class_process_config()) Ignoring unknown param jobid_var=procname_uid
LustreError: 6967:0:(obd_config.c:1199:class_config_llog_handler()) Err -22 on cfg command:
Lustre:    cmd=cf00f 0:(null)  1:sys.jobid_var=procname_uid  2:procname_uid  
LustreError: 15b-f: MGC10.10.4.132@tcp: The configuration from log 'lustre-client' failed (-22). Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC10.10.4.132@tcp: The configuration from log 'lustre-client' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 6955:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -22
LustreError: 6955:0:(lov_obd.c:1009:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
LustreError: 6955:0:(mdc_request.c:1498:mdc_precleanup()) client import never connected
LustreError: 6955:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 6955:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client lustre-client(ffff880331cbc000) umount complete
LustreError: 6955:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount  (-22)

Here is the test report of rolling upgrade: https://maloo.whamcloud.com/test_sets/c3ef59ee-142a-11e2-af8d-52540035b04c



 Comments   
Comment by Jian Yu [ 12/Oct/12 ]

The jobstats feature is disabled by default on 2.3.0, but is enabled by test-framework.sh while running auster test suite. In the above rolling upgrade test, parallel-scale was run, so jobstats was enabled after the upgrading.

Without enabling the jobstats feature on 2.3.0 after upgrading, downgrading passed.

Comment by Andreas Dilger [ 13/Oct/12 ]

I think the major problem here is that the client code does not skip the unknown config command as it should. This is bad for two reasons:

  • upgrade and downgrade may always introduce new parameters and remove or rename old ones, so if this causes a mount failure it is a very serious problem
  • there is not currently any sanity checking of conf_param names, since the parameters only exist on a subset of nodes, and the MGS does not have access to them

We need to verify whether this is a problem in only 1.8.8 or if it is also in newer releases. In my local Lustre filesystem (1.8.6) I have such a parameter for the OST that is properly ignored, so I wonder whether this is a regression in 1.8.8 or if the problem exists on the client only?

Since it is unlikely that we will make another 2.3 release I would strongly prefer to fix this before the release.

Generated at Sat Feb 10 01:22:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.