Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7410

After downgrade from 2.8 to 2.5.5, hit unsupported incompat filesystem feature(s) 400

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • None
    • None
    • before upgrade: lustre-master #3226 RHEL6.7
      after upgrade: lustre-b2_5_fe #62 RHEL6.6
    • 3
    • 9223372036854775807

    Description

      1. upgrade system from 2.5.5 RHEL6.6 to master RHEL6.7 PASS
      2. downgrade system from master RHEL6.7 to 2.5.5 6.6 FAIL

      mount MDS failed

      Lustre: DEBUG MARKER: == upgrade-downgrade End == 15:01:41 (1447110101)
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
      Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo)
      Lustre: lustre-MDT0000: used disk, loading
      LustreError: 12684:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400
      LustreError: 12684:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22)
      LustreError: 12684:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22
      Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  
      LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
      LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 12589:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22
      LustreError: 12589:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22
      LustreError: 12589:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
      LustreError: 12589:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
      LustreError: 12589:0:(obd_config.c:619:class_cleanup()) Device 5 not setup
      Lustre: 12589:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1447110105/real 1447110105]  req@ffff8808352bac00 x1517404919169064/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1447110111 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: server umount lustre-MDT0000 complete
      LustreError: 12589:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount  (-22)
      Lustre: DEBUG MARKER: Using TIMEOUT=100
      Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
      Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo)
      Lustre: lustre-MDT0000: used disk, loading
      LustreError: 13112:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400
      LustreError: 13112:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22)
      LustreError: 13112:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22
      Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  
      LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
      LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      LustreError: 13025:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22
      LustreError: 13025:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22
      LustreError: 13025:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
      LustreError: 13025:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
      LustreError: 13025:0:(obd_config.c:619:class_cleanup()) Device 5 not setup
      Lustre: 13025:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1447110256/real 1447110256]  req@ffff88081d67dc00 x1517404919169104/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1447110262 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: server umount lustre-MDT0000 complete
      LustreError: 13025:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount  (-22)
      Lustre: DEBUG MARKER: Using TIMEOUT=100
      Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
      [root@onyx-25 ~]# 
      

      Attachments

        1. trace-after
          482 kB
        2. dmesg-before
          93 kB
        3. dmesg-after
          93 kB
        4. debug-after
          43 kB

        Issue Links

          Activity

            [LU-7410] After downgrade from 2.8 to 2.5.5, hit unsupported incompat filesystem feature(s) 400
            pjones Peter Jones added a comment -

            I think that we can safely close this out from a community release point of view. Upgrade/downgrade from 2.5.x to 2.9 is outside the official scope of the release and there is a viable workaround for those who want to try this anyway.

            pjones Peter Jones added a comment - I think that we can safely close this out from a community release point of view. Upgrade/downgrade from 2.5.x to 2.9 is outside the official scope of the release and there is a viable workaround for those who want to try this anyway.
            sarah Sarah Liu added a comment - - edited

            Thank you YuJian for the information.

            Niu, I tried with "-f" option(step 7) and it seems work, upgrade from EE2.4.2.2 RHEL6.8 to EE3.0.1 RHEL7.2 and downgrade again:
            MDS

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            Reading last_rcvd
            UUID lustre-MDT0000_UUID
            Feature compat=0xc
            Feature incompat=0x21c
            Feature rocompat=0x1
            Last transaction 17179869184
            target index 0
            MDS, index 0
            [root@onyx-27 ~]# mount
            /dev/sda1 on / type ext3 (rw)
            proc on /proc type proc (rw)
            sysfs on /sys type sysfs (rw)
            devpts on /dev/pts type devpts (rw,gid=5,mode=620)
            tmpfs on /dev/shm type tmpfs (rw)
            none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
            sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
            nfsd on /proc/fs/nfsd type nfsd (rw)
            onyx-4.onyx.hpdd.intel.com:/export/scratch on /scratch type nfs (rw,vers=4,addr=10.2.0.2,clientaddr=10.2.4.65)
            /dev/sdb1 on /mnt/mds1 type lustre (rw,acl,user_xattr)
            

            I also record the lc_reader value for all steps for reference:
            1. first time mount system under EE2.4.2.2 RHEL6.8

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            Reading last_rcvd
            UUID lustre-MDT0000_UUID
            Feature compat=0xc
            Feature incompat=0x21c
            Feature rocompat=0x1
            Last transaction 4294967296
            target index 0
            MDS, index 0
            

            2. umount MDS

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            Reading last_rcvd
            UUID lustre-MDT0000_UUID
            Feature compat=0xc
            Feature incompat=0x21c
            Feature rocompat=0x1
            Last transaction 4294967307
            target index 0
            MDS, index 0
            

            3. after upgrade to EE3.0.1 RHEL7 and remount

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            last_rcvd:
              uuid: lustre-MDT0000_UUID
              feature_compat: 0xc
              feature_incompat: 0x61c
              feature_rocompat: 0x1
              last_transaction: 8589934592
              target_index: 0
              mount_count: 2
            [root@onyx-27 ~]# 
            

            4. umount again

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            last_rcvd:
              uuid: lustre-MDT0000_UUID
              feature_compat: 0xc
              feature_incompat: 0x21c
              feature_rocompat: 0x1
              last_transaction: 8589934594
              target_index: 0
              mount_count: 2
            [root@onyx-27 ~]# 
            

            5. remount with abort_recovery

            [root@onyx-27 ~]# mount -t lustre -o abort_recovery /dev/sdb1 /mnt/mds1
            [ 1098.170025] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
            [ 1098.827347] LustreError: 23424:0:(mdt_handler.c:5840:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device
            [root@onyx-27 ~]# mountg[ 1103.554471] Lustre: 23226:0:(client.c:2029:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1471453302/real 1471453302]  req@ffff8807fd682d00 x1542930245353916/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.74@tcp:28/4 lens 520/544 e 0 to 1 dl 1471453307 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 1103.595823] Lustre: 23226:0:(client.c:2029:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [root@onyx-27 ~]# lr_reader /dev/sdb1
            last_rcvd:
              uuid: lustre-MDT0000_UUID
              feature_compat: 0xc
              feature_incompat: 0x21c
              feature_rocompat: 0x1
              last_transaction: 12884901888
              target_index: 0
              mount_count: 3
            

            6. umount with "-f"

            [root@onyx-27 ~]# umount -f /mnt/mds1
            [ 1321.101538] Lustre: server umount lustre-MDT0000 complete
            [root@onyx-27 ~]# lr_reader /dev/sdb1
            last_rcvd:
              uuid: lustre-MDT0000_UUID
              feature_compat: 0xc
              feature_incompat: 0x21c
              feature_rocompat: 0x1
              last_transaction: 12884901888
              target_index: 0
              mount_count: 3
            [root@onyx-27 ~]# 
            

            7. downgrade the system to EE2.4.2.2 and mount again

            [root@onyx-27 ~]# lr_reader /dev/sdb1
            Reading last_rcvd
            UUID lustre-MDT0000_UUID
            Feature compat=0xc
            Feature incompat=0x21c
            Feature rocompat=0x1
            Last transaction 17179869184
            target index 0
            MDS, index 0
            
            sarah Sarah Liu added a comment - - edited Thank you YuJian for the information. Niu, I tried with "-f" option(step 7) and it seems work, upgrade from EE2.4.2.2 RHEL6.8 to EE3.0.1 RHEL7.2 and downgrade again: MDS [root@onyx-27 ~]# lr_reader /dev/sdb1 Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 17179869184 target index 0 MDS, index 0 [root@onyx-27 ~]# mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) onyx-4.onyx.hpdd.intel.com:/export/scratch on /scratch type nfs (rw,vers=4,addr=10.2.0.2,clientaddr=10.2.4.65) /dev/sdb1 on /mnt/mds1 type lustre (rw,acl,user_xattr) I also record the lc_reader value for all steps for reference: 1. first time mount system under EE2.4.2.2 RHEL6.8 [root@onyx-27 ~]# lr_reader /dev/sdb1 Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 4294967296 target index 0 MDS, index 0 2. umount MDS [root@onyx-27 ~]# lr_reader /dev/sdb1 Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 4294967307 target index 0 MDS, index 0 3. after upgrade to EE3.0.1 RHEL7 and remount [root@onyx-27 ~]# lr_reader /dev/sdb1 last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x61c feature_rocompat: 0x1 last_transaction: 8589934592 target_index: 0 mount_count: 2 [root@onyx-27 ~]# 4. umount again [root@onyx-27 ~]# lr_reader /dev/sdb1 last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x21c feature_rocompat: 0x1 last_transaction: 8589934594 target_index: 0 mount_count: 2 [root@onyx-27 ~]# 5. remount with abort_recovery [root@onyx-27 ~]# mount -t lustre -o abort_recovery /dev/sdb1 /mnt/mds1 [ 1098.170025] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache [ 1098.827347] LustreError: 23424:0:(mdt_handler.c:5840:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device [root@onyx-27 ~]# mountg[ 1103.554471] Lustre: 23226:0:(client.c:2029:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1471453302/real 1471453302] req@ffff8807fd682d00 x1542930245353916/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.74@tcp:28/4 lens 520/544 e 0 to 1 dl 1471453307 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1103.595823] Lustre: 23226:0:(client.c:2029:ptlrpc_expire_one_request()) Skipped 1 previous similar message [root@onyx-27 ~]# lr_reader /dev/sdb1 last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x21c feature_rocompat: 0x1 last_transaction: 12884901888 target_index: 0 mount_count: 3 6. umount with "-f" [root@onyx-27 ~]# umount -f /mnt/mds1 [ 1321.101538] Lustre: server umount lustre-MDT0000 complete [root@onyx-27 ~]# lr_reader /dev/sdb1 last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x21c feature_rocompat: 0x1 last_transaction: 12884901888 target_index: 0 mount_count: 3 [root@onyx-27 ~]# 7. downgrade the system to EE2.4.2.2 and mount again [root@onyx-27 ~]# lr_reader /dev/sdb1 Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 17179869184 target index 0 MDS, index 0
            sarah Sarah Liu added a comment -

            Hi Niu,

            No I didn't use "-f", I will try today and get back to you. thank you!

            sarah Sarah Liu added a comment - Hi Niu, No I didn't use "-f", I will try today and get back to you. thank you!

            Hi, Sarah

            How did you umount the MDT in 7th step?

            6. do additional step, remounting the MDS with abort_recovery option
            7. umount the MDS again
            

            If you didn't use "umount -f", could you try with "umount -f" to see if the problem can be reproduced?

            niu Niu Yawei (Inactive) added a comment - Hi, Sarah How did you umount the MDT in 7th step? 6. do additional step, remounting the MDS with abort_recovery option 7. umount the MDS again If you didn't use "umount -f", could you try with "umount -f" to see if the problem can be reproduced?
            yujian Jian Yu added a comment -

            Hi Gregoire,

            I performed a basic clean upgrading/downing testing from EE 2.4.2.2 (tag 2.5.42.15) to EE 3.0.1.0 (tag 2.7.16.5) with the following steps:

            1. format and mount EE 2.4.2.2 filesystem with 1 MGS/MDS (1 MDT), 1 OSS (1 OST) and 1 Client
              # lr_reader /dev/sdc
              Reading last_rcvd
              UUID lustre-MDT0000_UUID
              Feature compat=0xc
              Feature incompat=0x21c
              Feature rocompat=0x1
              Last transaction 4294967296
              target index 0
              MDS, index 0
              
            2. unmount the whole filesystem
              # lr_reader /dev/sdc
              Reading last_rcvd
              UUID lustre-MDT0000_UUID
              Feature compat=0xc
              Feature incompat=0x21c
              Feature rocompat=0x1
              Last transaction 4294967296
              target index 0
              MDS, index 0
              
            3. upgrade the whole filesystem to EE 3.0.1.0
            4. mount the whole filesystem
            5. unmount the whole filesystem
              # lr_reader /dev/sdc
              last_rcvd:
                uuid: lustre-MDT0000_UUID
                feature_compat: 0xc
                feature_incompat: 0x61c
                feature_rocompat: 0x1
                last_transaction: 8589934592
                target_index: 0
                mount_count: 2
              
            6. remount MDS and OSS with "-o abort_recovery" option
              # lr_reader /dev/sdc
              last_rcvd:
                uuid: lustre-MDT0000_UUID
                feature_compat: 0xc
                feature_incompat: 0x61c
                feature_rocompat: 0x1
                last_transaction: 12884901888
                target_index: 0
                mount_count: 3
              
            7. unmount MDS and OSS
              # lr_reader /dev/sdc
              last_rcvd:
                uuid: lustre-MDT0000_UUID
                feature_compat: 0xc
                feature_incompat: 0x61c
                feature_rocompat: 0x1
                last_transaction: 12884901888
                target_index: 0
                mount_count: 3
              
            8. downgrade the whole filesystem to EE 2.4.2.2
              # lr_reader /dev/sdc
              Reading last_rcvd
              UUID lustre-MDT0000_UUID
              Feature compat=0xc
              Feature incompat=0x61c
              Feature rocompat=0x1
              Last transaction 12884901888
              target index 0
              MDS, index 0
              
            9. mount MDS still failed:
              LustreError: 24312:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400
              
              # lr_reader /dev/sdc
              Reading last_rcvd
              UUID lustre-MDT0000_UUID
              Feature compat=0xc
              Feature incompat=0x61c
              Feature rocompat=0x1
              Last transaction 12884901888
              target index 0
              MDS, index 0
              
            yujian Jian Yu added a comment - Hi Gregoire, I performed a basic clean upgrading/downing testing from EE 2.4.2.2 (tag 2.5.42.15) to EE 3.0.1.0 (tag 2.7.16.5) with the following steps: format and mount EE 2.4.2.2 filesystem with 1 MGS/MDS (1 MDT), 1 OSS (1 OST) and 1 Client # lr_reader /dev/sdc Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 4294967296 target index 0 MDS, index 0 unmount the whole filesystem # lr_reader /dev/sdc Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x21c Feature rocompat=0x1 Last transaction 4294967296 target index 0 MDS, index 0 upgrade the whole filesystem to EE 3.0.1.0 mount the whole filesystem unmount the whole filesystem # lr_reader /dev/sdc last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x61c feature_rocompat: 0x1 last_transaction: 8589934592 target_index: 0 mount_count: 2 remount MDS and OSS with "-o abort_recovery" option # lr_reader /dev/sdc last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x61c feature_rocompat: 0x1 last_transaction: 12884901888 target_index: 0 mount_count: 3 unmount MDS and OSS # lr_reader /dev/sdc last_rcvd: uuid: lustre-MDT0000_UUID feature_compat: 0xc feature_incompat: 0x61c feature_rocompat: 0x1 last_transaction: 12884901888 target_index: 0 mount_count: 3 downgrade the whole filesystem to EE 2.4.2.2 # lr_reader /dev/sdc Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x61c Feature rocompat=0x1 Last transaction 12884901888 target index 0 MDS, index 0 mount MDS still failed: LustreError: 24312:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400 # lr_reader /dev/sdc Reading last_rcvd UUID lustre-MDT0000_UUID Feature compat=0xc Feature incompat=0x61c Feature rocompat=0x1 Last transaction 12884901888 target index 0 MDS, index 0

            Could you add to the script some calls to the command "lr_reader <mdt-target-device>" at different places (between steps 5-6, 6-7, 7-8 and 8-9 for example) ?
            This could help identify why the incompatibility flag is not cleared.

            The output of the command looks like this:

            # lr_reader  /dev/sdc
            last_rcvd:
              uuid: fs3-MDT0000_UUID
              feature_compat: 0x8
              feature_incompat: 0x61c
              feature_rocompat: 0x1
              last_transaction: 30064771072
              target_index: 0
              mount_count: 44
            

            The flag OBD_INCOMPAT_MULTI_RPCS = 0x400 can be checked within feature_incompat value.

            pichong Gregoire Pichon added a comment - Could you add to the script some calls to the command "lr_reader <mdt-target-device>" at different places (between steps 5-6, 6-7, 7-8 and 8-9 for example) ? This could help identify why the incompatibility flag is not cleared. The output of the command looks like this: # lr_reader /dev/sdc last_rcvd: uuid: fs3-MDT0000_UUID feature_compat: 0x8 feature_incompat: 0x61c feature_rocompat: 0x1 last_transaction: 30064771072 target_index: 0 mount_count: 44 The flag OBD_INCOMPAT_MULTI_RPCS = 0x400 can be checked within feature_incompat value.
            sarah Sarah Liu added a comment - - edited

            MDS logs before and after downgrade

            update: I tried today with b2_8/build #11, manually ran those steps without using script, it doesn't hit the problem.

            sarah Sarah Liu added a comment - - edited MDS logs before and after downgrade update: I tried today with b2_8/build #11, manually ran those steps without using script, it doesn't hit the problem.
            sarah Sarah Liu added a comment - - edited

            the complete case is:
            1. format and setup the system with 1 MDS(1MDT), 1 OSS(1 OST) and 2 clients with lustre 2.5.5 RHEL6.6; create some data
            2. shundown the whole system, umount all nodes
            3. upgrade the whole system to b2_8/build #8; only clear the boot disk, keep data disk untouched
            4. remount the whole system, check the data, works fine;
            5. shudown the whole system again, umount all nodes
            6. do additional step, remounting the MDS with abort_recovery option
            7. umount the MDS again
            8. downgrade all servers and clients to 2.5.5 again without touching the data disk
            9. mount MDS failed as above.

            Please find the attached for more logs. 'before means before downgrade; after means after downgrade'

            sarah Sarah Liu added a comment - - edited the complete case is: 1. format and setup the system with 1 MDS(1MDT), 1 OSS(1 OST) and 2 clients with lustre 2.5.5 RHEL6.6; create some data 2. shundown the whole system, umount all nodes 3. upgrade the whole system to b2_8/build #8; only clear the boot disk, keep data disk untouched 4. remount the whole system, check the data, works fine; 5. shudown the whole system again, umount all nodes 6. do additional step, remounting the MDS with abort_recovery option 7. umount the MDS again 8. downgrade all servers and clients to 2.5.5 again without touching the data disk 9. mount MDS failed as above. Please find the attached for more logs. 'before means before downgrade; after means after downgrade'

            Could you provide the complete test case that was executed ?
            How is designed the filesystem (nodes hosting the MGT, MDTs, OSTs, client nodes...) ?
            It would be helpful to also provide the full MDS lustre log, including the log before downgrade and log after downgrade.

            thanks.

            pichong Gregoire Pichon added a comment - Could you provide the complete test case that was executed ? How is designed the filesystem (nodes hosting the MGT, MDTs, OSTs, client nodes...) ? It would be helpful to also provide the full MDS lustre log, including the log before downgrade and log after downgrade. thanks.
            sarah Sarah Liu added a comment -

            Hello Gregoire,

            I hit the same issue recently, on master/tag-2.7.66 and b2_8/tag-2.7.90. I did remount the MDS with option "abort_recovery" and umount it again before downgrading, here is what I saw. The same test passed on tag-2.7.64, do you have any idea why this happens?

            on MDS

            [root@onyx-25 ~]# mount -t lustre -o abort_recovery /dev/sdb1 /mnt/mds1
            LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
            Lustre: MGS: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo)
            Lustre: Skipped 4 previous similar messages
            LustreError: 45919:0:(mdt_handler.c:5735:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device
            [root@onyx-25 ~]# mount
            /dev/sda1 on / type ext3 (rw)
            proc on /proc type proc (rw)
            sysfs on /sys type sysfs (rw)
            devpts on /dev/pts type devpts (rw,gid=5,mode=620)
            tmpfs on /dev/shm type tmpfs (rw)
            none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
            sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
            nfsd on /proc/fs/nfsd type nfsd (rw)
            /dev/sdb1 on /mnt/mds1 type lustre (rw,abort_recovery)
            [root@onyx-25 ~]# Lustre: 23885:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455732424/real 1455732424]  req@ffff8808074dfcc0 x1526383585120268/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 520/544 e 0 to 1 dl 1455732429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: lustre-MDT0000: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo)
            
            
            [root@onyx-25 ~]# umount /mnt/mds1
            Lustre: Failing over lustre-MDT0000
            
            Lustre: 46030:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455732461/real 1455732461]  req@ffff88080d158cc0 x1526383585120452/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455732467 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: server umount lustre-MDT0000 complete
            [root@onyx-25 ~]# 
            [root@onyx-25 ~]# mount
            /dev/sda1 on / type ext3 (rw)
            proc on /proc type proc (rw)
            sysfs on /sys type sysfs (rw)
            devpts on /dev/pts type devpts (rw,gid=5,mode=620)
            tmpfs on /dev/shm type tmpfs (rw)
            none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
            sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
            nfsd on /proc/fs/nfsd type nfsd (rw)
            

            dmesg of MDS

            LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
            Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo)
            Lustre: lustre-MDT0000: used disk, loading
            LustreError: 10899:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400
            LustreError: 10899:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22)
            LustreError: 10899:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22
            Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  
            LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
            LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            LustreError: 10804:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22
            LustreError: 10804:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22
            LustreError: 10804:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
            LustreError: 10804:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
            LustreError: 10804:0:(obd_config.c:619:class_cleanup()) Device 5 not setup
            Lustre: 10804:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455737214/real 1455737214]  req@ffff8808181aec00 x1526451090227240/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455737220 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: server umount lustre-MDT0000 complete
            LustreError: 10804:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount  (-22)
            Lustre: DEBUG MARKER: Using TIMEOUT=100
            Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
            Slow work thread pool: Starting up
            Slow work thread pool: Ready
            FS-Cache: Loaded
            NFS: Registering the id_resolver key type
            FS-Cache: Netfs 'nfs' registered for caching
            LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
            Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo)
            Lustre: lustre-MDT0000: used disk, loading
            LustreError: 34262:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400
            LustreError: 34262:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22)
            LustreError: 34262:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22
            Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  
            LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22).  Make sure this client and the MGS are running compatible versions of Lustre.
            LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            LustreError: 34110:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22
            LustreError: 34110:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22
            LustreError: 34110:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
            LustreError: 34110:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
            LustreError: 34110:0:(obd_config.c:619:class_cleanup()) Device 5 not setup
            Lustre: 34110:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455737367/real 1455737367]  req@ffff880412fb3800 x1526451090227280/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455737373 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: server umount lustre-MDT0000 complete
            LustreError: 34110:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount  (-22)
            Lustre: DEBUG MARKER: Using TIMEOUT=100
            Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted
            [root@onyx-25 ~]# 
            
            sarah Sarah Liu added a comment - Hello Gregoire, I hit the same issue recently, on master/tag-2.7.66 and b2_8/tag-2.7.90. I did remount the MDS with option "abort_recovery" and umount it again before downgrading, here is what I saw. The same test passed on tag-2.7.64, do you have any idea why this happens? on MDS [root@onyx-25 ~]# mount -t lustre -o abort_recovery /dev/sdb1 /mnt/mds1 LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: MGS: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo) Lustre: Skipped 4 previous similar messages LustreError: 45919:0:(mdt_handler.c:5735:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device [root@onyx-25 ~]# mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/sdb1 on /mnt/mds1 type lustre (rw,abort_recovery) [root@onyx-25 ~]# Lustre: 23885:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455732424/real 1455732424] req@ffff8808074dfcc0 x1526383585120268/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.4.56@tcp:28/4 lens 520/544 e 0 to 1 dl 1455732429 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000: Connection restored to MGC10.2.4.47@tcp_0 (at 0@lo) [root@onyx-25 ~]# umount /mnt/mds1 Lustre: Failing over lustre-MDT0000 Lustre: 46030:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455732461/real 1455732461] req@ffff88080d158cc0 x1526383585120452/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455732467 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete [root@onyx-25 ~]# [root@onyx-25 ~]# mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) dmesg of MDS LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo) Lustre: lustre-MDT0000: used disk, loading LustreError: 10899:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400 LustreError: 10899:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22) LustreError: 10899:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22 Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre. LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 10804:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22 LustreError: 10804:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22 LustreError: 10804:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client. LustreError: 10804:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2) LustreError: 10804:0:(obd_config.c:619:class_cleanup()) Device 5 not setup Lustre: 10804:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455737214/real 1455737214] req@ffff8808181aec00 x1526451090227240/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455737220 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete LustreError: 10804:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount (-22) Lustre: DEBUG MARKER: Using TIMEOUT=100 Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted Slow work thread pool: Starting up Slow work thread pool: Ready FS-Cache: Loaded NFS: Registering the id_resolver key type FS-Cache: Netfs 'nfs' registered for caching LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: MGC10.2.4.47@tcp: Connection restored to MGS (at 0@lo) Lustre: lustre-MDT0000: used disk, loading LustreError: 34262:0:(mdt_recovery.c:263:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 400 LustreError: 34262:0:(obd_config.c:572:class_setup()) setup lustre-MDT0000 failed (-22) LustreError: 34262:0:(obd_config.c:1629:class_config_llog_handler()) MGC10.2.4.47@tcp: cfg command failed: rc = -22 Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f LustreError: 15b-f: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre. LustreError: 15c-8: MGC10.2.4.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 34110:0:(obd_mount_server.c:1254:server_start_targets()) failed to start server lustre-MDT0000: -22 LustreError: 34110:0:(obd_mount_server.c:1737:server_fill_super()) Unable to start targets: -22 LustreError: 34110:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client. LustreError: 34110:0:(obd_mount_server.c:1422:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2) LustreError: 34110:0:(obd_config.c:619:class_cleanup()) Device 5 not setup Lustre: 34110:0:(client.c:1943:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1455737367/real 1455737367] req@ffff880412fb3800 x1526451090227280/t0(0) o251->MGC10.2.4.47@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1455737373 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount lustre-MDT0000 complete LustreError: 34110:0:(obd_mount.c:1330:lustre_fill_super()) Unable to mount (-22) Lustre: DEBUG MARKER: Using TIMEOUT=100 Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: NAME=ncli not mounted [root@onyx-25 ~]#

            People

              pichong Gregoire Pichon
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: