[LU-1848] interop issue 2,2 clients can't talk to 2.3 servers Created: 06/Sep/12  Updated: 10/Sep/12  Resolved: 10/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jinshan Xiong (Inactive) Assignee: Lai Siyao
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4263

 Description   

with the following error message:

[root@client-17 ~]# uname -r
2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64
[root@client-17 ~]# rpm -qa |grep lustre
lustre-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
kernel-2.6.32-220.4.2.el6_lustre.g45b2fe8.x86_64
lustre-modules-2.2.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
lustre-ldiskfs-3.3.0-2.6.32_220.4.2.el6_lustre.g45b2fe8.x86_64_g25a1427.x86_64
[root@client-17 ~]# mount -t lustre client-18@tcp:/lustre /mnt/lustre
mount.lustre: mount client-18@tcp:/lustre at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root@client-17 ~]# dmesg
Lustre: MGC10.10.4.18@tcp: Reactivating import
Lustre: 6833:0:(obd_config.c:1002:class_process_config()) Ignoring unknown param jobid_var=procname_uid
LustreError: 6833:0:(obd_config.c:1362:class_config_llog_handler()) Err -22 on cfg command:
Lustre:    cmd=cf00f 0:(null)  1:sys.jobid_var=procname_uid  2:procname_uid  
LustreError: 15b-f: MGC10.10.4.18@tcp: The configuration from log 'lustre-client'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC10.10.4.18@tcp: The configuration from log 'lustre-client' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 6821:0:(llite_lib.c:978:ll_fill_super()) Unable to process log: -22
LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
LustreError: 6736:0:(lov_obd.c:928:lov_cleanup()) Skipped 3 previous similar messages
LustreError: 6821:0:(ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 6821:0:(ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff880329563000 umount complete
LustreError: 6821:0:(obd_mount.c:2349:lustre_fill_super()) Unable to mount  (-22)

We need to fix it by:
1. detecting the client version and deciding if to provide new config options, there is a similar one here: http://review.whamcloud.com/3836
2. add a sanity test case to check if clients can skip unknown config options.



 Comments   
Comment by Peter Jones [ 06/Sep/12 ]

Sarah will look into this one

Comment by Oleg Drokin [ 06/Sep/12 ]

possibly might be helped by this too: http://review.whamcloud.com/3806

Comment by Jinshan Xiong (Inactive) [ 06/Sep/12 ]

After taking a closer look, we don't need a connect bit or something to address this issue. The only culprit is test-framework.sh which is eager to set jobvar stuff...

Comment by Peter Jones [ 06/Sep/12 ]

Lai will take care of this

Comment by Sarah Liu [ 07/Sep/12 ]

I cannot reproduce this issue with following config:

MDS and OST 2.3-tag2.2.94
client1: 2.2.0
client2: master

https://maloo.whamcloud.com/test_sessions/7948cadc-f8bf-11e1-b9a7-52540035b04c

Comment by Lai Siyao [ 07/Sep/12 ]

Sarah, it occurs between 2.2 client and 2.3 (not master) server.

Comment by Sarah Liu [ 07/Sep/12 ]

Sarah, it occurs between 2.2 client and 2.3 (not master) server.

I used 2.3 as servers, master was just another client which was the case Jinshan told me

Comment by Jinshan Xiong (Inactive) [ 07/Sep/12 ]

Hi Sarah, I started my cluster with auster and then mounted a 2.2 client manually. From what I have seen, the following piece of code has run:

in test-framework.sh, function init_param_vars():

        local jobid_var
        if [ -z "$(lctl get_param -n mdc.*.connect_flags | grep jobstats)" ]; then
                jobid_var="none"
        elif [ $JOBSTATS_AUTO -ne 0 ]; then
                echo "enable jobstats, set job scheduler as $JOBID_VAR"
                jobid_var=$JOBID_VAR
        else
                jobid_var=`$LCTL get_param -n jobid_var`
                if [ $jobid_var != "disable" ]; then
                        echo "disable jobstats as required"
                        jobid_var="disable"
                else
                        jobid_var="none"
                fi
        fi

        if [ $jobid_var == $JOBID_VAR -o $jobid_var == "disable" ]; then
                do_facet mgs $LCTL conf_param $FSNAME.sys.jobid_var=$jobid_var
                wait_update $HOSTNAME "$LCTL get_param -n jobid_var" \
                        $jobid_var || return 1
        fi

by default JOBSTATS_AUTO is 1. This caused lctl conf_param was called to set jobid_var and 2.2 clients don't understand this config items for sure.

Comment by Lai Siyao [ 09/Sep/12 ]

Jinshan, I think before you test 2.2 client, you've setup system from a 2.3 client, which will enable jobid. So the right way to test 2.2 client and 2.3 server is to setup the system from 2.2 client (which is okay). So this should not be a bug.

Comment by Jinshan Xiong (Inactive) [ 09/Sep/12 ]

Yes I agree.

Generated at Sat Feb 10 01:20:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.