Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2888

After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0, Lustre 2.1.6
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • before upgrade, server and client: 2.1.4 RHEL6
      after upgrade, server and client: lustre-master build# 1270 RHEL6
    • 3
    • 6970

    Description

      Here are what I did:
      1. format the system as 2.1.4 and then upgrade to 2.4, success.
      2. showdown the filesystem and disable quota
      3. downgrade the system to 2.1.4 again, when mount MDS, hit following errors

      Here is the console of MDS:

      Lustre: DEBUG MARKER: == upgrade-downgrade End == 18:53:45 (1362020025)
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
      Lustre: MGS MGS started
      Lustre: 7888:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 7306ea48-8511-52b2-40cf-6424fc417e41@0@lo t0 exp (null) cur 1362020029 last 0
      Lustre: MGC10.10.4.132@tcp: Reactivating import
      Lustre: MGS: Logs for fs lustre were removed by user request.  All servers must be restarted in order to regenerate the logs.
      Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      Lustre: Setting parameter lustre-clilov.lov.stripesize in log lustre-client
      Lustre: Enabling ACL
      Lustre: Enabling user_xattr
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed: 
      LustreError: 7901:0:(osd_handler.c:2343:osd_index_try()) LBUG
      Pid: 7901, comm: llog_process_th
      
      Message from
      Call Trace:
       syslogd@fat-amd [<ffffffffa03797f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      -1 at Feb 27 18: [<ffffffffa0379e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      53:49 ...
       ker [<ffffffffa0d6bd74>] osd_index_try+0x84/0x540 [osd_ldiskfs]
      nel:LustreError: [<ffffffffa04c1dfe>] dt_try_as_dir+0x3e/0x60 [obdclass]
       7901:0:(osd_han [<ffffffffa0c5eb3a>] orph_index_init+0x6a/0x1e0 [mdd]
      dler.c:2343:osd_ [<ffffffffa0c6ec45>] mdd_prepare+0x1d5/0x640 [mdd]
      index_try()) ASS [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      ERTION( dt_objec [<ffffffffa0da0499>] cmm_prepare+0x39/0xe0 [cmm]
      t_exists(dt) ) f [<ffffffffa0ccfd7d>] mdt_device_alloc+0xe0d/0x2190 [mdt]
      ailed: 
      
      Me [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
      ssage from syslo [<ffffffffa04a2c87>] obd_setup+0x1d7/0x2f0 [obdclass]
      gd@fat-amd-1 at  [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
      Feb 27 18:53:49  [<ffffffffa04a2fa8>] class_setup+0x208/0x890 [obdclass]
      ...
       kernel:Lu [<ffffffffa04aac6c>] class_process_config+0xc3c/0x1c30 [obdclass]
      streError: 7901: [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
      0:(osd_handler.c [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
      :2343:osd_index_ [<ffffffffa04acd0b>] class_config_llog_handler+0x9bb/0x1610 [obdclass]
      try()) LBUG
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      
      Kernel panic - not syncing: LBUG
      Pid: 7901, comm: llog_process_th Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1
      Call Trace:
      
       [<ffffffff814fdcba>] ? panic+0xa0/0x168
      Message from sy [<ffffffffa0379e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      slogd@fat-amd-1  [<ffffffffa0d6bd74>] ? osd_index_try+0x84/0x540 [osd_ldiskfs]
      at Feb 27 18:53: [<ffffffffa04c1dfe>] ? dt_try_as_dir+0x3e/0x60 [obdclass]
      49 ...
       kernel [<ffffffffa0c5eb3a>] ? orph_index_init+0x6a/0x1e0 [mdd]
      :Kernel panic -  [<ffffffffa0c6ec45>] ? mdd_prepare+0x1d5/0x640 [mdd]
      not syncing: LBU [<ffffffffa0ccd23c>] ? mdt_process_config+0x6c/0x1030 [mdt]
      G
       [<ffffffffa0da0499>] ? cmm_prepare+0x39/0xe0 [cmm]
       [<ffffffffa0ccfd7d>] ? mdt_device_alloc+0xe0d/0x2190 [mdt]
       [<ffffffffa04bdeff>] ? keys_fill+0x6f/0x1a0 [obdclass]
       [<ffffffffa04a2c87>] ? obd_setup+0x1d7/0x2f0 [obdclass]
       [<ffffffffa048ef3b>] ? class_new_export+0x72b/0x960 [obdclass]
       [<ffffffffa04a2fa8>] ? class_setup+0x208/0x890 [obdclass]
       [<ffffffffa04aac6c>] ? class_process_config+0xc3c/0x1c30 [obdclass]
       [<ffffffffa037a993>] ? cfs_alloc+0x63/0x90 [libcfs]
       [<ffffffffa04a5813>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
       [<ffffffffa04acd0b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass]
       [<ffffffffa0637e3b>] ? llog_client_next_block+0x1db/0x4b0 [ptlrpc]
       [<ffffffffa0478098>] ? llog_process_thread+0x888/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c14a>] ? child_rip+0xa/0x20
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffffa0477810>] ? llog_process_thread+0x0/0xd00 [obdclass]
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Initializing cgroup subsys cpuset
      Initializing cgroup subsys cpu
      

      Attachments

        Issue Links

          Activity

            [LU-2888] After downgrade from 2.4 to 2.1.4, hit (osd_handler.c:2343:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed
            bobijam Zhenyu Xu added a comment -

            I have an example of ostid here extracted from lov_merge_lvb_kms()

            LustreError: 8277:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x51:1024 ost_idx:0
            LustreError: 8277:0:(lustre_idl.h:706:ostid_to_fid()) 0x51:0x200000400

            The second log message format is as follows

                                    CERROR("bad MDT0 id, "DOSTID" ost_idx:%u\n",
                                            POSTID(ostid), ost_idx);
                                    CERROR(LPX64":"LPX64"\n", ostid->oi.oi_id, ostid->oi.oi_seq);
            
            bobijam Zhenyu Xu added a comment - I have an example of ostid here extracted from lov_merge_lvb_kms() LustreError: 8277:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x51:1024 ost_idx:0 LustreError: 8277:0:(lustre_idl.h:706:ostid_to_fid()) 0x51:0x200000400 The second log message format is as follows CERROR( "bad MDT0 id, " DOSTID " ost_idx:%u\n" , POSTID(ostid), ost_idx); CERROR(LPX64 ":" LPX64 "\n" , ostid->oi.oi_id, ostid->oi.oi_seq);
            bobijam Zhenyu Xu added a comment - - edited

            http://review.whamcloud.com/6034 does not solve the issue. The error message is not from llog handling, it's from stat/ls the file created by 2.1 system.

            attached is the -1 log.

            bobijam Zhenyu Xu added a comment - - edited http://review.whamcloud.com/6034 does not solve the issue. The error message is not from llog handling, it's from stat/ls the file created by 2.1 system. attached is the -1 log.
            di.wang Di Wang added a comment -

            Bobi: Could you please try this patch? http://review.whamcloud.com/#change,6034 It seems lmm_oi is being overwritten by some other threads. As Andreas said it might related with ostid_to_fid for llog object. So this patch will use oi_id/oi_seq directly to identify the log object, so to avoid ostid_to_fid conversion. Could you please try a few times, I guess this problem can not be reproduced often, at least I can not reproduce it locally. Thanks.

            di.wang Di Wang added a comment - Bobi: Could you please try this patch? http://review.whamcloud.com/#change,6034 It seems lmm_oi is being overwritten by some other threads. As Andreas said it might related with ostid_to_fid for llog object. So this patch will use oi_id/oi_seq directly to identify the log object, so to avoid ostid_to_fid conversion. Could you please try a few times, I guess this problem can not be reproduced often, at least I can not reproduce it locally. Thanks.

            The pre-LU-2684 CATALOGS file looks like:

            000000 000000000000000a 0000000000000001
            000010 0000000000000000 0000000000000000
            000020 000000000000000c 0000000000000001
            000030 0000000000000000 0000000000000000
            000040 000000000000000e 0000000000000001
            000050 0000000000000000 0000000000000000
            

            so indeed this is an unintentional compatibility breakage with the new code.

            adilger Andreas Dilger added a comment - The pre- LU-2684 CATALOGS file looks like: 000000 000000000000000a 0000000000000001 000010 0000000000000000 0000000000000000 000020 000000000000000c 0000000000000001 000030 0000000000000000 0000000000000000 000040 000000000000000e 0000000000000001 000050 0000000000000000 0000000000000000 so indeed this is an unintentional compatibility breakage with the new code.

            Only the latest issue may be related to LU-2684, not the original one.

            adilger Andreas Dilger added a comment - Only the latest issue may be related to LU-2684 , not the original one.

            Di, it appears that this might relate to http://review.whamcloud.com/5820 incorrectly converting struct llog_logid to use struct ost_id. The fields in that structure were already used a bit strangely, and I think that the ostid_to_fid() and fid_to_ostid() macros may be mangling the fields in this structure, since they were never really used as OST id/seq/, but rather MDS inode/0/generation.

            On 1.8 the CATALOGS file looks like oid=ino/ogr=0/ogen=gen:

            0000000 0000000000000021 0000000000000000
            0000020 000000001d4f2a9d 0000000000000000
            0000040 0000000000000022 0000000000000000
            0000060 000000001d4f2a9e 0000000000000000
            0000100 0000000000000023 0000000000000000
            0000120 000000001d4f2a9f 0000000000000000
            

            On 2.1 (upgraded from 1.8, if it matters) the CATALOGS file looks like oid=ino/oseq=1/ogen=gen:

            000000 0000000000020002 0000000000000001
            000010 00000000ec467621 0000000000000000
            000020 0000000000020003 0000000000000001
            000030 00000000ec467622 0000000000000000
            000040 0000000000020004 0000000000000001
            000050 00000000ec467623 0000000000000000
            000060 0000000000020005 0000000000000001
            000070 0000000095595788 0000000000000000
            000080 000000000002005b 0000000000000001
            000090 000000001ecc9141 0000000000000000
            

            On 2.4 (new) the CATALOGS file looks like (f_seq=1/f_oid=oid/ogen=0):

            000000 0000000000000001 0000000000000002
            000010 0000000000000000 0000000000000000
            000020 0000000000000001 0000000000000004
            000030 0000000000000000 0000000000000000
            000040 0000000000000001 0000000000000006
            000050 0000000000000000 0000000000000000
            

            I suspect that the latter change is unintentional, since I think it breaks the on-disk and on-wire llog_logid protocol. I'm just running another test with a pre 6794d7654b4c459519a9e6d85ed439c8c594c2e7 build to see what the CATALOGS file looks like.

            Di, could you please work up a patch that reverts the changes to llog_logid and its usage, and we can see if that fixes this latest issue.

            adilger Andreas Dilger added a comment - Di, it appears that this might relate to http://review.whamcloud.com/5820 incorrectly converting struct llog_logid to use struct ost_id. The fields in that structure were already used a bit strangely, and I think that the ostid_to_fid() and fid_to_ostid() macros may be mangling the fields in this structure, since they were never really used as OST id/seq/, but rather MDS inode/0/generation. On 1.8 the CATALOGS file looks like oid=ino/ogr=0/ogen=gen: 0000000 0000000000000021 0000000000000000 0000020 000000001d4f2a9d 0000000000000000 0000040 0000000000000022 0000000000000000 0000060 000000001d4f2a9e 0000000000000000 0000100 0000000000000023 0000000000000000 0000120 000000001d4f2a9f 0000000000000000 On 2.1 (upgraded from 1.8, if it matters) the CATALOGS file looks like oid=ino/oseq=1/ogen=gen: 000000 0000000000020002 0000000000000001 000010 00000000ec467621 0000000000000000 000020 0000000000020003 0000000000000001 000030 00000000ec467622 0000000000000000 000040 0000000000020004 0000000000000001 000050 00000000ec467623 0000000000000000 000060 0000000000020005 0000000000000001 000070 0000000095595788 0000000000000000 000080 000000000002005b 0000000000000001 000090 000000001ecc9141 0000000000000000 On 2.4 (new) the CATALOGS file looks like (f_seq=1/f_oid=oid/ogen=0): 000000 0000000000000001 0000000000000002 000010 0000000000000000 0000000000000000 000020 0000000000000001 0000000000000004 000030 0000000000000000 0000000000000000 000040 0000000000000001 0000000000000006 000050 0000000000000000 0000000000000000 I suspect that the latter change is unintentional, since I think it breaks the on-disk and on-wire llog_logid protocol. I'm just running another test with a pre 6794d7654b4c459519a9e6d85ed439c8c594c2e7 build to see what the CATALOGS file looks like. Di, could you please work up a patch that reverts the changes to llog_logid and its usage, and we can see if that fixes this latest issue.
            bobijam Zhenyu Xu added a comment -

            Just update what I found. My local test shows the error message is from lov_attr_get_raid0()

            00020000:00000001:0.0:1365706828.914809:0:18347:0:(lov_object.c:420:lov_attr_get_raid0()) Process entered
            00020000:00020000:0.0:1365706828.914809:0:18347:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x54:1024 ost_idx:0
            00020000:00000002:0.0:1365706828.914810:0:18347:0:(lov_merge.c:76:lov_merge_lvb_kms()) MDT FID [0x0:0x0:0x0] initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0
            00020000:00000001:0.0:1365706828.914811:0:18347:0:(lov_offset.c:59:lov_stripe_size()) Process entered
            00020000:00000001:0.0:1365706828.914811:0:18347:0:(lov_offset.c:74:lov_stripe_size()) Process leaving (rc=370 : 370 : 172)
            00020000:00000001:0.0:1365706828.914812:0:18347:0:(lov_offset.c:59:lov_stripe_size()) Process entered
            00020000:00000001:0.0:1365706828.914812:0:18347:0:(lov_offset.c:74:lov_stripe_size()) Process leaving (rc=370 : 370 : 172)
            00020000:00000002:0.0:1365706828.914813:0:18347:0:(lov_merge.c:110:lov_merge_lvb_kms()) MDT FID [0x0:0x0:0x0] on OST[1]: s=370 m=1365706747 a=1365706747 c=1365706747 b=8
            00020000:00000001:0.0:1365706828.914814:0:18347:0:(lov_merge.c:119:lov_merge_lvb_kms()) Process leaving (rc=0 : 0 : 0)
            00000020:00000001:0.0:1365706828.914814:0:18347:0:(cl_object.c:1019:cl_lvb2attr()) Process entered
            00000020:00000001:0.0:1365706828.914814:0:18347:0:(cl_object.c:1025:cl_lvb2attr()) Process leaving
            00020000:00000001:0.0:1365706828.914815:0:18347:0:(lov_object.c:476:lov_attr_get_raid0()) Process leaving (rc=0 : 0 : 0)
            

            while in lov_merge_lvb_kms(), stack variable 'fid' has no effect except for displaying debug messages.

            WangDi, need your input.

            bobijam Zhenyu Xu added a comment - Just update what I found. My local test shows the error message is from lov_attr_get_raid0() 00020000:00000001:0.0:1365706828.914809:0:18347:0:(lov_object.c:420:lov_attr_get_raid0()) Process entered 00020000:00020000:0.0:1365706828.914809:0:18347:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x54:1024 ost_idx:0 00020000:00000002:0.0:1365706828.914810:0:18347:0:(lov_merge.c:76:lov_merge_lvb_kms()) MDT FID [0x0:0x0:0x0] initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000001:0.0:1365706828.914811:0:18347:0:(lov_offset.c:59:lov_stripe_size()) Process entered 00020000:00000001:0.0:1365706828.914811:0:18347:0:(lov_offset.c:74:lov_stripe_size()) Process leaving (rc=370 : 370 : 172) 00020000:00000001:0.0:1365706828.914812:0:18347:0:(lov_offset.c:59:lov_stripe_size()) Process entered 00020000:00000001:0.0:1365706828.914812:0:18347:0:(lov_offset.c:74:lov_stripe_size()) Process leaving (rc=370 : 370 : 172) 00020000:00000002:0.0:1365706828.914813:0:18347:0:(lov_merge.c:110:lov_merge_lvb_kms()) MDT FID [0x0:0x0:0x0] on OST[1]: s=370 m=1365706747 a=1365706747 c=1365706747 b=8 00020000:00000001:0.0:1365706828.914814:0:18347:0:(lov_merge.c:119:lov_merge_lvb_kms()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1365706828.914814:0:18347:0:(cl_object.c:1019:cl_lvb2attr()) Process entered 00000020:00000001:0.0:1365706828.914814:0:18347:0:(cl_object.c:1025:cl_lvb2attr()) Process leaving 00020000:00000001:0.0:1365706828.914815:0:18347:0:(lov_object.c:476:lov_attr_get_raid0()) Process leaving (rc=0 : 0 : 0) while in lov_merge_lvb_kms(), stack variable 'fid' has no effect except for displaying debug messages. WangDi, need your input.
            bobijam Zhenyu Xu added a comment -

            Wang Di,

            The upgrade test shows these messages on client side

            Maloo report: https://maloo.whamcloud.com/test_sessions/ef5c83fe-a2be-11e2-81ba-52540035b04c
            
            I found the following error messages in the console log of client wtm-81 after upgrading:
            
            LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6dec:1024 ost_idx:0
            LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6dec:1024 ost_idx:0
            LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6d78:1024 ost_idx:0
            LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) Skipped 31 previous similar messages
            LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6d6d:1024 ost_idx:0
            

            I noticed that it's from your recent patch, what the issue could it be? Is it potentially harmful?

            bobijam Zhenyu Xu added a comment - Wang Di, The upgrade test shows these messages on client side Maloo report: https://maloo.whamcloud.com/test_sessions/ef5c83fe-a2be-11e2-81ba-52540035b04c I found the following error messages in the console log of client wtm-81 after upgrading: LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6dec:1024 ost_idx:0 LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6dec:1024 ost_idx:0 LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6d78:1024 ost_idx:0 LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) Skipped 31 previous similar messages LustreError: 33998:0:(lustre_idl.h:705:ostid_to_fid()) bad MDT0 id, 0x6d6d:1024 ost_idx:0 I noticed that it's from your recent patch, what the issue could it be? Is it potentially harmful?
            bobijam Zhenyu Xu added a comment -

            it turns out that the 'empty file' issue is not a bug, I was using llmountcleanup.sh which calls 'umount -f' and it does not flush dirty data.

            I'll push another patch which can pass the upgrade & downgrade test on my VM environment.

            bobijam Zhenyu Xu added a comment - it turns out that the 'empty file' issue is not a bug, I was using llmountcleanup.sh which calls 'umount -f' and it does not flush dirty data. I'll push another patch which can pass the upgrade & downgrade test on my VM environment.
            bobijam Zhenyu Xu added a comment -

            Fired a ticket (LU-3141) for the 'empty file' issue.

            bobijam Zhenyu Xu added a comment - Fired a ticket ( LU-3141 ) for the 'empty file' issue.
            bobijam Zhenyu Xu added a comment -

            strangely, I don't even need master version, with b2_1 alone, format & mount & copy files & umount & mount it again, all files become 0.

            btw, I'm using current b2_1 lustre code with linux-2.6.32-279.22.1.el6 kernel code.

            bobijam Zhenyu Xu added a comment - strangely, I don't even need master version, with b2_1 alone, format & mount & copy files & umount & mount it again, all files become 0. btw, I'm using current b2_1 lustre code with linux-2.6.32-279.22.1.el6 kernel code.

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: