Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2888

After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • before upgrade, server and client: 2.1.4 RHEL6
      after upgrade, server and client: lustre-master build# 1270 RHEL6
    • 3
    • 6970

    Description

      Here are what I did:
      1. format the system as 2.1.4 and then upgrade to 2.4, success.
      2. showdown the filesystem and disable quota
      3. downgrade the system to 2.1.4 again, when mount MDS, hit following errors

      Here is the console of MDS:

      Lustre: DEBUG MARKER: == upgrade-downgrade End == 18:53:45 (1362020025)
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      Lustre: MGS MGS started
      Lustre: 7888:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 7306ea48-8511-52b2-40cf-6424fc417e41@0@lo t0 exp (null) cur 1362020029 last 0
      Lustre: MGC10.10.4.132@tcp: Reactivating import
      Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
      Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client
      Lustre: Enabling ACL
      Lustre: Enabling user_xattr
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed: 
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) LBUG
      Pid: 7901, comm: llog_process_th
      
      Message from
      Call Trace:
       syslogd@fat-amd [<ffffffffa03797f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      -1 at Feb 27 18: [<ffffffffa0379e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      53:49 ...
       ker [<ffffffffa0d6bd74>] osd_index_try+0x84/0x540 [osd_ldiskfs]
      nel:LustreError: [<ffffffffa04c1dfe>] dt_try_as_dir+0x3e/0x60 [obdclass]
       7901:0:(osd_han [<ffffffffa0c5eb3a>] orph_index_init+0x6a/0x1e0 [mdd]
      dler.c:2343:osd_ [<ffffffffa0c6ec45>] mdd_prepare+0x1d5/0x640 [mdd]
      index_try()) ASS [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      ERTION( dt_objec [<ffffffffa0da0499>] cmm_prepare+0x39/0xe0 [cmm]
      t_exists(dt) ) f [<ffffffffa0ccfd7d>] mdt_device_alloc+0xe0d/0x2190 [mdt]
      ailed: 
      
      Me [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
      ssage from syslo [<ffffffffa04a2c87>] obd_setup+0x1d7/0x2f0 [obdclass]
      gd@fat-amd-1 at  [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
      Feb 27 18:53:49  [<ffffffffa04a2fa8>] class_setup+0x208/0x890 [obdclass]
      ...
       kernel:Lu [<ffffffffa04aac6c>] class_process_config+0xc3c/0x1c30 [obdclass]
      streError: 7901: [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
      0:(osd_handler.c [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
      :2343:osd_index_ [<ffffffffa04acd0b>] class_config_llog_handler+0x9bb/0x1610 [obdclass]
      try()) LBUG
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      
      Kernel panic - not syncing: LBUG
      Pid: 7901, comm: llog_process_th Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1
      Call Trace:
      
       [<ffffffff814fdcba>] ? panic+0xa0/0x168
      Message from sy [<ffffffffa0379e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      slogd@fat-amd-1  [<ffffffffa0d6bd74>] ? osd_index_try+0x84/0x540 [osd_ldiskfs]
      at Feb 27 18:53: [<ffffffffa04c1dfe>] ? dt_try_as_dir+0x3e/0x60 [obdclass]
      49 ...
       kernel [<ffffffffa0c5eb3a>] ? orph_index_init+0x6a/0x1e0 [mdd]
      :Kernel panic -  [<ffffffffa0c6ec45>] ? mdd_prepare+0x1d5/0x640 [mdd]
      not syncing: LBU [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      G
       [<ffffffffa0da0499>] ? cmm_prepare+0x39/0xe0 [cmm]
       [<ffffffffa0ccfd7d>] ? mdt_device_alloc+0xe0d/0x2190 [mdt]
       [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
       [<ffffffffa04a2c87>] ? obd_setup+0x1d7/0x2f0 [obdclass]
       [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
       [<ffffffffa04a2fa8>] ? class_setup+0x208/0x890 [obdclass]
       [<ffffffffa04aac6c>] ? class_process_config+0xc3c/0x1c30 [obdclass]
       [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
       [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
       [<ffffffffa04acd0b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass]
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] ? llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] ? child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Initializing cgroup subsys cpuset
      Initializing cgroup subsys cpu
      

      Attachments

        Issue Links

          Activity

            [LU-2888] After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed
            bobijam Zhenyu Xu added a comment -

            Fired a ticket (LU-3141) for the 'empty file' issue.

            bobijam Zhenyu Xu added a comment - Fired a ticket ( LU-3141 ) for the 'empty file' issue.
            bobijam Zhenyu Xu added a comment -

            strangely, I don't even need master version, with b2_1 alone, format & mount & copy files & umount & mount it again, all files become 0.

            btw, I'm using current b2_1 lustre code with linux-2.6.32-279.22.1.el6 kernel code.

            bobijam Zhenyu Xu added a comment - strangely, I don't even need master version, with b2_1 alone, format & mount & copy files & umount & mount it again, all files become 0. btw, I'm using current b2_1 lustre code with linux-2.6.32-279.22.1.el6 kernel code.

            iirc, Li Wei did put a function to create such an image in conf-sanity.sh

            bzzz Alex Zhuravlev added a comment - iirc, Li Wei did put a function to create such an image in conf-sanity.sh
            bobijam Zhenyu Xu added a comment - - edited

            Is it possible that the disk2_1-ldiskfs.tar.bz2 was created with an outdated compatible b2_1 version? I just tested with current b2_1 and master branch, the procedure is

            1. create a lustre filesystem with b2_1 version
            2. copy /etc/* to this filesystem
            3. umount it
            4. mount the file system with master version, succeeded
            5. 'ls -l' the filesystem, all files are there but all sized 0 with no content.

            bobijam Zhenyu Xu added a comment - - edited Is it possible that the disk2_1-ldiskfs.tar.bz2 was created with an outdated compatible b2_1 version? I just tested with current b2_1 and master branch, the procedure is 1. create a lustre filesystem with b2_1 version 2. copy /etc/* to this filesystem 3. umount it 4. mount the file system with master version, succeeded 5. 'ls -l' the filesystem, all files are there but all sized 0 with no content.

            this is rather strange.. I do remember Li Wei improved conf-sanity/32 to verify actual data with md5sum.

            bzzz Alex Zhuravlev added a comment - this is rather strange.. I do remember Li Wei improved conf-sanity/32 to verify actual data with md5sum.

            Bobijam, that sounds like a very critical problem. Does conf-sanity test 32 not detect this problem during 2.1 to 2.4 upgrade? Is that problem repeatable?

            Please file a separate bug for that problem and make it a 2.4 blocker until it is better understood and fixed.

            adilger Andreas Dilger added a comment - Bobijam, that sounds like a very critical problem. Does conf-sanity test 32 not detect this problem during 2.1 to 2.4 upgrade? Is that problem repeatable? Please file a separate bug for that problem and make it a 2.4 blocker until it is better understood and fixed.
            bobijam Zhenyu Xu added a comment -

            Actually, I found that when 2.1 formatted disk upgraded 2.4, all files' size becomes 0, their content are lost.

            bobijam Zhenyu Xu added a comment - Actually, I found that when 2.1 formatted disk upgraded 2.4, all files' size becomes 0, their content are lost.
            bobijam Zhenyu Xu added a comment -

            This log shows that 2.4 MDT start cannot find the old llog objects, and created new ones in 2.4 way (using llog_osd_ops), which 2.1 code (using llog_lvfs_ops) cannot recognise.

            bobijam Zhenyu Xu added a comment - This log shows that 2.4 MDT start cannot find the old llog objects, and created new ones in 2.4 way (using llog_osd_ops), which 2.1 code (using llog_lvfs_ops) cannot recognise.
            yujian Jian Yu added a comment -

            So is my reaing right that the test actually failed because all OSCs are in inactive state, so it won't be possible to create any new files on such a filesystem?

            Right, touching a new file on Lustre client failed as follows:

            # touch /mnt/lustre/file
            touch: cannot touch `/mnt/lustre/file': Input/output error
            

            I do not see any attempts for the test to actually create anything post-downgrade so perhaps it's a case we are missing?

            The upgrade/downgrade testing needs to be performed as per wiki page https://wiki.hpdd.intel.com/display/ENG/Upgrade+and+Downgrade+Testing. As we can see, the data creating/verifying steps are included in the post-downgrade phases. However, these test cases are not covered in one test script currently. The script which detected the issue in this ticket was upgrade-downgrade.sh, which covered extra quotas and OST pools testing along the upgrade/downgrade path.

            The issue is that while performing this script to do downgrade testing, the extra quotas testing was disabled (due to the new way of setting quotas on master branch), only OST pools testing was covered along the downgrade path, which only verified that the existing files/directories could be accessed and the striping info was correct, but did not create new files.

            So, in order to cover all of the test cases in the wiki page, we need improve upgrade-downgrade.sh to make the quotas codes work on master branch, and also need run the other two test scripts {clean,rolling}-upgrade-downgrade.sh before the test cases covered by them are added into upgrade-downgrade.sh.

            yujian Jian Yu added a comment - So is my reaing right that the test actually failed because all OSCs are in inactive state, so it won't be possible to create any new files on such a filesystem? Right, touching a new file on Lustre client failed as follows: # touch /mnt/lustre/file touch: cannot touch `/mnt/lustre/file': Input/output error I do not see any attempts for the test to actually create anything post-downgrade so perhaps it's a case we are missing? The upgrade/downgrade testing needs to be performed as per wiki page https://wiki.hpdd.intel.com/display/ENG/Upgrade+and+Downgrade+Testing . As we can see, the data creating/verifying steps are included in the post-downgrade phases. However, these test cases are not covered in one test script currently. The script which detected the issue in this ticket was upgrade-downgrade.sh, which covered extra quotas and OST pools testing along the upgrade/downgrade path. The issue is that while performing this script to do downgrade testing, the extra quotas testing was disabled (due to the new way of setting quotas on master branch), only OST pools testing was covered along the downgrade path, which only verified that the existing files/directories could be accessed and the striping info was correct, but did not create new files. So, in order to cover all of the test cases in the wiki page, we need improve upgrade-downgrade.sh to make the quotas codes work on master branch, and also need run the other two test scripts {clean,rolling}-upgrade-downgrade.sh before the test cases covered by them are added into upgrade-downgrade.sh.
            bobijam Zhenyu Xu added a comment -

            Without those patches having on-disk change porting backward, the upgrade then downgrade test would be a headache.

            bobijam Zhenyu Xu added a comment - Without those patches having on-disk change porting backward, the upgrade then downgrade test would be a headache.
            bobijam Zhenyu Xu added a comment -

            I don't know about the test case, but the latest error has something about CATALOGS file changing in 2.4.

            the CATALOGS write by 2.1 is as follows (logid is i_ino+ __u64 0x0+i_generation)

            1. od -x /mnt/mds1/CATALOGS
              0000000 0021 0000 0000 0000 0000 0000 0000 0000
              0000020 2a9d 1d4f 0000 0000 0000 0000 0000 0000
              0000040 0022 0000 0000 0000 0000 0000 0000 0000
              0000060 2a9e 1d4f 0000 0000 0000 0000 0000 0000
              0000100 0023 0000 0000 0000 0000 0000 0000 0000
              0000120 2a9f 1d4f 0000 0000 0000 0000 0000 0000
              0000140

            after 2.4 mounted it, the CATALOGS logic arrays changes to

            1. od -x /mnt/mds1/CATALOGS
              0000000 0002 0000 0000 0000 0001 0000 0000 0000
              0000020 0000 0000 0000 0000 0000 0000 0000 0000
              0000040 0004 0000 0000 0000 0001 0000 0000 0000
              0000060 0000 0000 0000 0000 0000 0000 0000 0000
              0000100 0006 0000 0000 0000 0001 0000 0000 0000
              0000120 0000 0000 0000 0000 0000 0000 0000 0000
              0000140
            bobijam Zhenyu Xu added a comment - I don't know about the test case, but the latest error has something about CATALOGS file changing in 2.4. the CATALOGS write by 2.1 is as follows (logid is i_ino+ __u64 0x0+i_generation) od -x /mnt/mds1/CATALOGS 0000000 0021 0000 0000 0000 0000 0000 0000 0000 0000020 2a9d 1d4f 0000 0000 0000 0000 0000 0000 0000040 0022 0000 0000 0000 0000 0000 0000 0000 0000060 2a9e 1d4f 0000 0000 0000 0000 0000 0000 0000100 0023 0000 0000 0000 0000 0000 0000 0000 0000120 2a9f 1d4f 0000 0000 0000 0000 0000 0000 0000140 after 2.4 mounted it, the CATALOGS logic arrays changes to od -x /mnt/mds1/CATALOGS 0000000 0002 0000 0000 0000 0001 0000 0000 0000 0000020 0000 0000 0000 0000 0000 0000 0000 0000 0000040 0004 0000 0000 0000 0001 0000 0000 0000 0000060 0000 0000 0000 0000 0000 0000 0000 0000 0000100 0006 0000 0000 0000 0001 0000 0000 0000 0000120 0000 0000 0000 0000 0000 0000 0000 0000 0000140

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: