Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8608

Rolling upgrade between 2.8.x and master failed: Upon upgrading OSS, OSS restarts when mounted

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • Rolling Upgrade: Old version- b2_8_fe build# 25
      New version- master build# 3431
    • 3
    • 9223372036854775807

    Description

      While performing rolling upgrade testing the OSS got restarted when it was mounted after the upgrade.
      Following steps were taken:
      1. OSS, MDS and 2 clients were built with b2_8_fe build# 25 and the lustre system was set up.
      2. Unmounted OST and upgraded the OSS to master build# 3431.
      3. After upgrade on OSS was complete , the target was mounted back.

      Upon mounting, the OSS restarted abruptly.
      Following is the log for OSS when the mount command was run.

      [root@onyx-26 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/ost0
      mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 16384
      mount.lustre: change scheduler of /sys/block/sdb/queue/scheduler from cfq to deadline
      [   79.285538] libcfs: module verification failed: signature and/or required key missing - tainting kernel
      [   79.302042] LNet: HW CPU cores: 32, npartitions: 4
      [   79.311423] alg: No test for adler32 (adler32-zlib)
      [   79.318433] alg: No test for crc32 (crc32-table)
      [   87.529705] Lustre: Lustre: Build Version: 2.8.57
      [   87.721568] LNet: Added LNI 10.2.4.56@tcp [8/256/0/180]
      [   87.728741] LNet: Accept secure, port 988
      [   88.022628] LDISKFS-fs (sdb1): file extents enabled, maximum tree depth=5
      [   88.426512] LDISKFS-fs (sdb1): recovery complete
      [   88.485928] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,,errors=remount-ro,no_mbcache
      [   88.864640] LustreError: 3112:0:(mgc_request.c:257:do_config_log_add()) MGC10.2.4.47@tcp: failed processing log, type 4: rc = -22
      [   88.971376] LustreError: 3368:0:(nodemap_storage.c:368:nodemap_idx_nodemap_add_update()) cannot add nodemap config to non-existing MGS.
      [   88.988471] LustreError: 3368:0:(nodemap_storage.c:1313:nodemap_fs_init()) lustre-OST0000: error loading nodemap config file, file must be removed via ldiskfs: rc = -22
      [   89.067996] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff8800b67832c0[0x0, 1, [0x1:0x0:0x0] hash exist]{
      [   89.067996] 
      [   89.085810] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff8800b6783310
      [   89.085810] 
      [   89.101070] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880035899c00osd-ldiskfs-object@ffff880035899c00(i:ffff880410851e88:81/3977440011)[plain]
      [   89.101070] 
      [   89.125243] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff8800b67832c0
      [   89.125243] 
      [   89.139953] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff880823297380[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
      [   89.139953] 
      [   89.159766] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff8808232973d0
      [   89.159766] 
      [   89.174780] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880426ff8500osd-ldiskfs-object@ffff880426ff8500(i:ffff880426368d88:80/3977439977)[plain]
      [   89.174780] 
      [   89.198510] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff880823297380
      [   89.198510] 
      [   89.213998] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff8800b6782b40[0x0, 1, [0xa:0x0:0x0] hash exist]{
      [   89.213998] 
      [   89.231128] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff8800b6782b90
      [   89.231128] 
      [   89.245899] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880035899400osd-ldiskfs-object@ffff880035899400(i:ffff88041085af88:82/3977440045)[plain]
      [   89.245899] 
      [   89.269322] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff8800b6782b40
      [   89.269322] 
      [   89.283367] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff880802cd1c80[0x0, 1, [0x200000003:0x8:0x0] hash exist]{
      [   89.283367] 
      [   89.302572] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff880802cd1cd0
      [   89.302572] 
      [   89.317098] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880823d2d900osd-ldiskfs-object@ffff880823d2d900(i:ffff8808163400c8:98/2123498910)[lfix]
      [   89.317098] 
      [   89.340058] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff880802cd1c80
      [   89.340058] 
      [   89.355308] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff8800b67829c0[0x0, 1, [0xa:0xa:0x0] hash exist]{
      [   89.355308] 
      [   89.372243] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff8800b6782a10
      [   89.372243] 
      [   89.386873] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880035899f00osd-ldiskfs-object@ffff880035899f00(i:ffff88041085b808:83/2755944006)[plain]
      [   89.386873] 
      [   89.410071] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff8800b67829c0
      [   89.410071] 
      [   89.424408] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) header@ffff8800b6782c00[0x0, 1, [0x200000001:0x1017:0x0] hash exist]{
      [   89.424408] 
      [   89.443666] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....local_storage@ffff8800b6782c50
      [   89.443666] 
      [   89.458132] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) ....osd-ldiskfs@ffff880035899d00osd-ldiskfs-object@ffff880035899d00(i:ffff880035f15a08:12/2606405092)[plain]
      [   89.458132] 
      [   89.481146] LustreError: 3368:0:(ofd_dev.c:248:ofd_stack_fini()) } header@ffff8800b6782c00
      [   89.481146] 
      [    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      [    0.000000] Initializing cgroup subsys cpuacct
      [    0.000000] Linux version 3.10.0-327.28.2.el7_lustre.x86_64 (jenkins@onyx-1-sdh1-el7-x8664.onyx.hpdd.intel.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Sep 1 10:55:39 PDT 2016
      

      Not sure whether it is related to LU-8498.

      Attachments

        1. debug_log_mds.txt
          6 kB
        2. mgs.log
          442 kB
        3. oss.log
          648 kB

        Activity

          [LU-8608] Rolling upgrade between 2.8.x and master failed: Upon upgrading OSS, OSS restarts when mounted
          standan Saurabh Tandan (Inactive) added a comment - - edited

          As the error message is expected hence closing the ticket.

          standan Saurabh Tandan (Inactive) added a comment - - edited As the error message is expected hence closing the ticket.

          Nope there are no functionality problems , OSS mounts okay.

          standan Saurabh Tandan (Inactive) added a comment - Nope there are no functionality problems , OSS mounts okay.

          Actually I think this is to be expected if the MGS is at 2.8. Does the OSS mount ok or are there functionality problems?

          kit.westneat Kit Westneat (Inactive) added a comment - Actually I think this is to be expected if the MGS is at 2.8. Does the OSS mount ok or are there functionality problems?
          standan Saurabh Tandan (Inactive) added a comment - - edited

          MDS debug_log file attached. I have the OSS debug_log file as well but its too big and not getting attached. Incase you want that too please let me know I will send it some other way to you.

          standan Saurabh Tandan (Inactive) added a comment - - edited MDS debug_log file attached. I have the OSS debug_log file as well but its too big and not getting attached. Incase you want that too please let me know I will send it some other way to you.

          Hi Saurabh,

          Ah these look like the dmesg logs, do you have the Lustre debug logs? I mean the logs that are generated by the lctl debug_kernel command. I'll need the trace and info log levels enabled in order to see what's going on.

          Thanks,
          Kit

          kit.westneat Kit Westneat (Inactive) added a comment - Hi Saurabh, Ah these look like the dmesg logs, do you have the Lustre debug logs? I mean the logs that are generated by the lctl debug_kernel command. I'll need the trace and info log levels enabled in order to see what's going on. Thanks, Kit

          Hi Kit,
          I have attached the log files for both MGS and OSS above. I also have the system set up currently. Please let me know incase you need any more information.
          Thanks!

          standan Saurabh Tandan (Inactive) added a comment - - edited Hi Kit, I have attached the log files for both MGS and OSS above. I also have the system set up currently. Please let me know incase you need any more information. Thanks!

          Hi Saurabh,

          Sorry for the delay in responding. Do you have the -1 debug logs (or trace and info) from the MGS and the OSS? I'm not sure why it'd be returning an error.

          Thanks,
          Kit

          kit.westneat Kit Westneat (Inactive) added a comment - Hi Saurabh, Sorry for the delay in responding. Do you have the -1 debug logs (or trace and info) from the MGS and the OSS? I'm not sure why it'd be returning an error. Thanks, Kit

          Hi Kit,
          I tried the testing with the patch mentioned above. The mount worked and the system did not restarted this time. But I could see a LustreError message in logs while OST was mounting. Is there any extra work needed for this?

          [root@onyx-26 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/ost0
          mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 16384
          mount.lustre: change scheduler o[ 2836.318943] libcfs: module verification failed: signature and/or required key missing - tainting kernel
          f /sys/block/sdb/queue/scheduler from cfq to dea[ 2836.333593] LNet: HW CPU cores: 32, npartitions: 4
          dline
          [ 2836.343150] alg: No test for adler32 (adler32-zlib)
          [ 2836.348967] alg: No test for crc32 (crc32-table)
          [ 2844.384607] Lustre: Lustre: Build Version: 2.8.57_22_g5cb1549
          [ 2844.422845] LNet: Added LNI 10.2.4.56@tcp [8/256/0/180]
          [ 2844.428845] LNet: Accept secure, port 988
          [ 2844.498034] LDISKFS-fs (sdb1): file extents enabled, maximum tree depth=5
          [ 2844.525873] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,,errors=remount-ro,no_mbcache
          [ 2844.883460] LustreError: 38233:0:(mgc_request.c:253:do_config_log_add()) MGC10.2.4.47@tcp: failed processing log, type 4: rc = -22
          [ 2845.382949] Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
          [root@onyx-26 ~]# [ 2852.091748] Lustre: lustre-OST0000: Will be in recovery for at least 2:30, or until 3 clients reconnect
          [ 2852.102484] Lustre: lustre-OST0000: Connection restored to b0ab0605-5282-cb64-ddd3-483f2393ac20 (at 10.2.4.36@tcp)
          [ 2853.948292] Lustre: lustre-OST0000: Connection restored to lustre-MDT0000-mdtlov_UUID (at 10.2.4.47@tcp)
          [ 2895.399155] Lustre: lustre-OST0000: Connection restored to 15ce59bd-a3c6-167b-84dd-730a88c0fe5f (at 10.2.4.37@tcp)
          [ 2895.801444] Lustre: lustre-OST0000: Recovery over after 0:44, of 3 clients 3 recovered and 0 were evicted.
          [ 2895.830113] Lustre: lustre-OST0000: deleting orphan objects from 0x0:4 to 0x0:33
          

          Thanks!

          standan Saurabh Tandan (Inactive) added a comment - Hi Kit, I tried the testing with the patch mentioned above. The mount worked and the system did not restarted this time. But I could see a LustreError message in logs while OST was mounting. Is there any extra work needed for this? [root@onyx-26 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/ost0 mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 16384 mount.lustre: change scheduler o[ 2836.318943] libcfs: module verification failed: signature and/or required key missing - tainting kernel f /sys/block/sdb/queue/scheduler from cfq to dea[ 2836.333593] LNet: HW CPU cores: 32, npartitions: 4 dline [ 2836.343150] alg: No test for adler32 (adler32-zlib) [ 2836.348967] alg: No test for crc32 (crc32-table) [ 2844.384607] Lustre: Lustre: Build Version: 2.8.57_22_g5cb1549 [ 2844.422845] LNet: Added LNI 10.2.4.56@tcp [8/256/0/180] [ 2844.428845] LNet: Accept secure, port 988 [ 2844.498034] LDISKFS-fs (sdb1): file extents enabled, maximum tree depth=5 [ 2844.525873] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,,errors=remount-ro,no_mbcache [ 2844.883460] LustreError: 38233:0:(mgc_request.c:253:do_config_log_add()) MGC10.2.4.47@tcp: failed processing log, type 4: rc = -22 [ 2845.382949] Lustre: lustre-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [root@onyx-26 ~]# [ 2852.091748] Lustre: lustre-OST0000: Will be in recovery for at least 2:30, or until 3 clients reconnect [ 2852.102484] Lustre: lustre-OST0000: Connection restored to b0ab0605-5282-cb64-ddd3-483f2393ac20 (at 10.2.4.36@tcp) [ 2853.948292] Lustre: lustre-OST0000: Connection restored to lustre-MDT0000-mdtlov_UUID (at 10.2.4.47@tcp) [ 2895.399155] Lustre: lustre-OST0000: Connection restored to 15ce59bd-a3c6-167b-84dd-730a88c0fe5f (at 10.2.4.37@tcp) [ 2895.801444] Lustre: lustre-OST0000: Recovery over after 0:44, of 3 clients 3 recovered and 0 were evicted. [ 2895.830113] Lustre: lustre-OST0000: deleting orphan objects from 0x0:4 to 0x0:33 Thanks!

          I will try it out with this patch.

          standan Saurabh Tandan (Inactive) added a comment - I will try it out with this patch.
          kit.westneat Kit Westneat (Inactive) added a comment - Hi Peter, This looks like a dupe of the second issue in LU-8508 : https://jira.hpdd.intel.com/browse/LU-8508?focusedCommentId=162247&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-162247 Is it possible to test with this patch? http://review.whamcloud.com/#/c/22004/ Thanks, Kit

          People

            kit.westneat Kit Westneat (Inactive)
            standan Saurabh Tandan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: