Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2888

After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • before upgrade, server and client: 2.1.4 RHEL6
      after upgrade, server and client: lustre-master build# 1270 RHEL6
    • 3
    • 6970

    Description

      Here are what I did:
      1. format the system as 2.1.4 and then upgrade to 2.4, success.
      2. showdown the filesystem and disable quota
      3. downgrade the system to 2.1.4 again, when mount MDS, hit following errors

      Here is the console of MDS:

      Lustre: DEBUG MARKER: == upgrade-downgrade End == 18:53:45 (1362020025)
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      Lustre: MGS MGS started
      Lustre: 7888:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 7306ea48-8511-52b2-40cf-6424fc417e41@0@lo t0 exp (null) cur 1362020029 last 0
      Lustre: MGC10.10.4.132@tcp: Reactivating import
      Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
      Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client
      Lustre: Enabling ACL
      Lustre: Enabling user_xattr
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed: 
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) LBUG
      Pid: 7901, comm: llog_process_th
      
      Message from
      Call Trace:
       syslogd@fat-amd [<ffffffffa03797f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      -1 at Feb 27 18: [<ffffffffa0379e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      53:49 ...
       ker [<ffffffffa0d6bd74>] osd_index_try+0x84/0x540 [osd_ldiskfs]
      nel:LustreError: [<ffffffffa04c1dfe>] dt_try_as_dir+0x3e/0x60 [obdclass]
       7901:0:(osd_han [<ffffffffa0c5eb3a>] orph_index_init+0x6a/0x1e0 [mdd]
      dler.c:2343:osd_ [<ffffffffa0c6ec45>] mdd_prepare+0x1d5/0x640 [mdd]
      index_try()) ASS [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      ERTION( dt_objec [<ffffffffa0da0499>] cmm_prepare+0x39/0xe0 [cmm]
      t_exists(dt) ) f [<ffffffffa0ccfd7d>] mdt_device_alloc+0xe0d/0x2190 [mdt]
      ailed: 
      
      Me [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
      ssage from syslo [<ffffffffa04a2c87>] obd_setup+0x1d7/0x2f0 [obdclass]
      gd@fat-amd-1 at  [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
      Feb 27 18:53:49  [<ffffffffa04a2fa8>] class_setup+0x208/0x890 [obdclass]
      ...
       kernel:Lu [<ffffffffa04aac6c>] class_process_config+0xc3c/0x1c30 [obdclass]
      streError: 7901: [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
      0:(osd_handler.c [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
      :2343:osd_index_ [<ffffffffa04acd0b>] class_config_llog_handler+0x9bb/0x1610 [obdclass]
      try()) LBUG
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      
      Kernel panic - not syncing: LBUG
      Pid: 7901, comm: llog_process_th Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1
      Call Trace:
      
       [<ffffffff814fdcba>] ? panic+0xa0/0x168
      Message from sy [<ffffffffa0379e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      slogd@fat-amd-1  [<ffffffffa0d6bd74>] ? osd_index_try+0x84/0x540 [osd_ldiskfs]
      at Feb 27 18:53: [<ffffffffa04c1dfe>] ? dt_try_as_dir+0x3e/0x60 [obdclass]
      49 ...
       kernel [<ffffffffa0c5eb3a>] ? orph_index_init+0x6a/0x1e0 [mdd]
      :Kernel panic -  [<ffffffffa0c6ec45>] ? mdd_prepare+0x1d5/0x640 [mdd]
      not syncing: LBU [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      G
       [<ffffffffa0da0499>] ? cmm_prepare+0x39/0xe0 [cmm]
       [<ffffffffa0ccfd7d>] ? mdt_device_alloc+0xe0d/0x2190 [mdt]
       [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
       [<ffffffffa04a2c87>] ? obd_setup+0x1d7/0x2f0 [obdclass]
       [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
       [<ffffffffa04a2fa8>] ? class_setup+0x208/0x890 [obdclass]
       [<ffffffffa04aac6c>] ? class_process_config+0xc3c/0x1c30 [obdclass]
       [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
       [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
       [<ffffffffa04acd0b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass]
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] ? llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] ? child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Initializing cgroup subsys cpuset
      Initializing cgroup subsys cpu
      

      Attachments

        Issue Links

          Activity

            [LU-2888] After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed

            Reducing 2.4 blocker, but keeping open until lands on b2_1.

            jlevi Jodi Levi (Inactive) added a comment - Reducing 2.4 blocker, but keeping open until lands on b2_1.
            bobijam Zhenyu Xu added a comment -

            http://review.whamcloud.com/#change,5731 hasn't landed on b2_1 yet, need that to be claimed to be fixed.

            bobijam Zhenyu Xu added a comment - http://review.whamcloud.com/#change,5731 hasn't landed on b2_1 yet, need that to be claimed to be fixed.

            Landed for 2.4

            jlevi Jodi Levi (Inactive) added a comment - Landed for 2.4
            jlevi Jodi Levi (Inactive) added a comment - - edited

            Change/6034 and 6037 have been merged into http://review.whamcloud.com/#change,6044

            jlevi Jodi Levi (Inactive) added a comment - - edited Change/6034 and 6037 have been merged into http://review.whamcloud.com/#change,6044
            bobijam Zhenyu Xu added a comment -

            yes, with http://review.whamcloud.com/5731 on b2_1 and http://review.whamcloud.com/6034 and http://review.whamcloud.com/6037 on master, the downgrade and upgrade test passed with no noise.

            bobijam Zhenyu Xu added a comment - yes, with http://review.whamcloud.com/5731 on b2_1 and http://review.whamcloud.com/6034 and http://review.whamcloud.com/6037 on master, the downgrade and upgrade test passed with no noise.
            di.wang Di Wang added a comment -

            http://review.whamcloud.com/#change,6037 Bobi: please check this one. Thanks!

            di.wang Di Wang added a comment - http://review.whamcloud.com/#change,6037 Bobi: please check this one. Thanks!
            di.wang Di Wang added a comment -

            Hmm, the problem is that in 2.1, we define the lsm/lmm in this

            struct lov_mds_md_v1 {            /* LOV EA mds/wire data (little-endian) */
                    __u32 lmm_magic;          /* magic number = LOV_MAGIC_V1 */
                    __u32 lmm_pattern;        /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */
                    __u64 lmm_object_id;      /* LOV object ID */
                    __u64 lmm_object_seq;     /* LOV object seq number */
                    __u32 lmm_stripe_size;    /* size of stripe in bytes */
                    __u32 lmm_stripe_count;   /* num stripes in use for this object */
                    struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */
            };        
            

            But lmm_object_seq/lmm_object_id is actually normal MDT FIDS, i.e. lmm_object_id/lmm_object_seq will be f_oid/normal_seq, and when unpack lmm to lsm on 2.4,

            it will use ostid_le_to_cpu()

            static inline void ostid_le_to_cpu(struct ost_id *src_oi,
                                               struct ost_id *dst_oi)
            {
                    if (fid_seq_is_mdt0(ostid_seq(src_oi))) {
                            dst_oi->oi.oi_id = le64_to_cpu(src_oi->oi.oi_id);
                            dst_oi->oi.oi_seq = le64_to_cpu(src_oi->oi.oi_seq);
                    } else {
                            fid_le_to_cpu(&dst_oi->oi_fid, &src_oi->oi_fid);
                    }
            }
            

            And treat the ostid as normal FID, which cause the problem.

            Sigh, it seems we do not have better way to convert this special ostid to the real FID.

            di.wang Di Wang added a comment - Hmm, the problem is that in 2.1, we define the lsm/lmm in this struct lov_mds_md_v1 { /* LOV EA mds/wire data (little-endian) */ __u32 lmm_magic; /* magic number = LOV_MAGIC_V1 */ __u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ __u64 lmm_object_id; /* LOV object ID */ __u64 lmm_object_seq; /* LOV object seq number */ __u32 lmm_stripe_size; /* size of stripe in bytes */ __u32 lmm_stripe_count; /* num stripes in use for this object */ struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */ }; But lmm_object_seq/lmm_object_id is actually normal MDT FIDS, i.e. lmm_object_id/lmm_object_seq will be f_oid/normal_seq, and when unpack lmm to lsm on 2.4, it will use ostid_le_to_cpu() static inline void ostid_le_to_cpu(struct ost_id *src_oi, struct ost_id *dst_oi) { if (fid_seq_is_mdt0(ostid_seq(src_oi))) { dst_oi->oi.oi_id = le64_to_cpu(src_oi->oi.oi_id); dst_oi->oi.oi_seq = le64_to_cpu(src_oi->oi.oi_seq); } else { fid_le_to_cpu(&dst_oi->oi_fid, &src_oi->oi_fid); } } And treat the ostid as normal FID, which cause the problem. Sigh, it seems we do not have better way to convert this special ostid to the real FID.
            bobijam Zhenyu Xu added a comment -

            I have an example of ostid here extracted from lov_merge_lvb_kms()

            LustreError: 8277:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x51:1024 ost_idx:0
            LustreError: 8277:0:(lustre_idl.h:706:ostid_to_fid()) 0x51:0x200000400

            The second log message format is as follows

                                    CERROR("bad MDT0 id, "DOSTID" ost_idx:%u\n",
                                            POSTID(ostid), ost_idx);
                                    CERROR(LPX64":"LPX64"\n", ostid->oi.oi_id, ostid->oi.oi_seq);
            
            bobijam Zhenyu Xu added a comment - I have an example of ostid here extracted from lov_merge_lvb_kms() LustreError: 8277:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x51:1024 ost_idx:0 LustreError: 8277:0:(lustre_idl.h:706:ostid_to_fid()) 0x51:0x200000400 The second log message format is as follows CERROR( "bad MDT0 id, " DOSTID " ost_idx:%u\n" , POSTID(ostid), ost_idx); CERROR(LPX64 ":" LPX64 "\n" , ostid->oi.oi_id, ostid->oi.oi_seq);
            bobijam Zhenyu Xu added a comment - - edited

            http://review.whamcloud.com/6034 does not solve the issue. The error message is not from llog handling, it's from stat/ls the file created by 2.1 system.

            attached is the -1 log.

            bobijam Zhenyu Xu added a comment - - edited http://review.whamcloud.com/6034 does not solve the issue. The error message is not from llog handling, it's from stat/ls the file created by 2.1 system. attached is the -1 log.
            di.wang Di Wang added a comment -

            Bobi: Could you please try this patch? http://review.whamcloud.com/#change,6034 It seems lmm_oi is being overwritten by some other threads. As Andreas said it might related with ostid_to_fid for llog object. So this patch will use oi_id/oi_seq directly to identify the log object, so to avoid ostid_to_fid conversion. Could you please try a few times, I guess this problem can not be reproduced often, at least I can not reproduce it locally. Thanks.

            di.wang Di Wang added a comment - Bobi: Could you please try this patch? http://review.whamcloud.com/#change,6034 It seems lmm_oi is being overwritten by some other threads. As Andreas said it might related with ostid_to_fid for llog object. So this patch will use oi_id/oi_seq directly to identify the log object, so to avoid ostid_to_fid conversion. Could you please try a few times, I guess this problem can not be reproduced often, at least I can not reproduce it locally. Thanks.

            The pre-LU-2684 CATALOGS file looks like:

            000000 000000000000000a 0000000000000001
            000010 0000000000000000 0000000000000000
            000020 000000000000000c 0000000000000001
            000030 0000000000000000 0000000000000000
            000040 000000000000000e 0000000000000001
            000050 0000000000000000 0000000000000000
            

            so indeed this is an unintentional compatibility breakage with the new code.

            adilger Andreas Dilger added a comment - The pre- LU-2684 CATALOGS file looks like: 000000 000000000000000a 0000000000000001 000010 0000000000000000 0000000000000000 000020 000000000000000c 0000000000000001 000030 0000000000000000 0000000000000000 000040 000000000000000e 0000000000000001 000050 0000000000000000 0000000000000000 so indeed this is an unintentional compatibility breakage with the new code.

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: