Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11158

PFL component instantiation is not replayed properly

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      While investigating LU-10961 I have found that component instantiation is not replayed. Test showing the problem:

      test_132a() {
      	[ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.9.90) ] &&
      		skip "Do not support PFL files before 2.10"
      
      	$LFS setstripe -E 1M -c 1 -E EOF -c 2 $DIR/$tfile
      	replay_barrier $SINGLEMDS
      	# write over the first component size cause next component instantiation
      	dd if=/dev/urandom of=$DIR/$tfile bs=1M count=1 seek=1 ||
      		error "dd to $DIR/$tfile failed"
      
      	cksum=$(md5sum $DIR/$tfile | awk '{print $1}')
      	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
      		error "Component #1 was not instantiated"
      
      	fail $SINGLEMDS
      
      	cksum2=$(md5sum $DIR/$tfile | awk '{print $1}')
      	if [ $cksum != $cksum2 ] ; then
      		error_noexit "New checksum $cksum2 does not match original $cksum"
      	fi
      	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
      		error "Component #1 instantiation was not replayed"
      }
      run_test 132a "PFL new component instantiate replay"
      

      it is double checked here - with checksums and by checking that next component has lmm_objects assigned. Both are failing in master.

      Attachments

        Issue Links

          Activity

            [LU-11158] PFL component instantiation is not replayed properly

            Yes, there usually exists layout intent in file's layout, but there also exist cases that the file only has partially layout defined.

             

            The philosophy behind the design is that the MDS should decide what layout it will allocate and how many components it should instantiate, so client technically doesn't know the actual EA size. Does this make sense to you?

            Jinshan Jinshan Xiong added a comment - Yes, there usually  exists layout intent in file's layout, but there also exist cases that the file only has partially layout defined.   The philosophy behind the design is that the MDS should decide what layout it will allocate and how many components it should instantiate, so client technically doesn't know the actual EA size. Does this make sense to you?
            tappro Mikhail Pershin added a comment - - edited

            on other hand I wonder why client can't supply correct EA size when updating layout? It knows the size, doesn't it? I mean reply buffer on client side can be allocated with proper size.

            tappro Mikhail Pershin added a comment - - edited on other hand I wonder why client can't supply correct EA size when updating layout? It knows the size, doesn't it? I mean reply buffer on client side can be allocated with proper size.

            IIRC, the mdt_lvbo_fill() may skip the EA getting just because something like that - "we can do nothing here, let's report new EA size back and there will be separate getxattr RPC". That is not working with RPCs to be replayed though.

            tappro Mikhail Pershin added a comment - IIRC, the mdt_lvbo_fill() may skip the EA getting just because something like that - "we can do nothing here, let's report new EA size back and there will be separate getxattr RPC". That is not working with RPCs to be replayed though.

            bobi - there were some good discussion about this problem before, but I don't remember the ticket number. Probably we should pick it up and make a complete solution I proposed there. The real problem here is because 'mdt_max_mdsize' keeps increasing by the current max mdsize it sees on the MDS(IIRC), which causes some problem for layout write when the system starts.

            Jinshan Jinshan Xiong added a comment - bobi - there were some good discussion about this problem before, but I don't remember the ticket number. Probably we should pick it up and make a complete solution I proposed there. The real problem here is because 'mdt_max_mdsize' keeps increasing by the current max mdsize it sees on the MDS(IIRC), which causes some problem for layout write when the system starts.

            This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

             

            If an open RPC has been committed, should we just ignore the layout in the replay RPC? In another word, it will just restore the open context on the MDS side.

            RPCs that modify layout components are just regular REINT RPCs. As long as they are committed, they will be gone, no need to replay them.

            Jinshan Jinshan Xiong added a comment - This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.   If an open RPC has been committed, should we just ignore the layout in the replay RPC? In another word, it will just restore the open context on the MDS side. RPCs that modify layout components are just regular REINT RPCs. As long as they are committed, they will be gone, no need to replay them.
            bobijam Zhenyu Xu added a comment -

            I don't know whether it is right or not that mdt_lvbo_fill() returns 0 when the LVBO buffer is smaller than the necessary EA size.

            bobijam Zhenyu Xu added a comment - I don't know whether it is right or not that mdt_lvbo_fill() returns 0 when the LVBO buffer is smaller than the necessary EA size.
            bobijam Zhenyu Xu added a comment - - edited

            I find that the mdt_lvbo_fill() complains that the lvblen (240) is small to hold the EA (sized 264). It looks like mdt_intent_layout() hasn't set RMF_DLM_LVB big enough.

            mdt_intent_layout()
                    if (mdt_object_exists(obj) && !mdt_object_remote(obj)) {
                            /* if layout is going to be changed don't use the current EA
                             * size but the maximum one. That buffer will be shrinked
                             * to the actual size in req_capsule_shrink() before reply.
                             */
                            if (layout.mlc_opc == MD_LAYOUT_WRITE) {
                                    layout_size = info->mti_mdt->mdt_max_mdsize;
                            } else {
                                    layout_size = mdt_attr_get_eabuf_size(info, obj);
                                    if (layout_size < 0)
                                            GOTO(out_obj, rc = layout_size);
            
                                    if (layout_size > info->mti_mdt->mdt_max_mdsize)
                                            info->mti_mdt->mdt_max_mdsize = layout_size;
                            }
                    }
            

            So I tried to change the default mdt_max_mdsize to a bigger size, and the test passed.

            diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h
            index 7999816676..20d13cb4f6 100644
            --- a/lustre/include/uapi/linux/lustre/lustre_idl.h
            +++ b/lustre/include/uapi/linux/lustre/lustre_idl.h
            @@ -1117,7 +1117,7 @@ struct lov_mds_md_v1 {            /* LOV EA mds/wire data (little-endian) */
                    struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */
             };
             
            -#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data))
            +#define MAX_MD_SIZE (sizeof(struct lov_comp_md_v1) + 4 *               \
            +                       (sizeof(struct lov_comp_md_entry_v1) +          \
            +                        (sizeof(struct lov_mds_md) + 4 *               \
            +                         sizeof(struct lov_ost_data))))
             #define MIN_MD_SIZE (sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data))
             
             /* This is the default MDT reply size allocated, should the striping be bigger, 
            bobijam Zhenyu Xu added a comment - - edited I find that the mdt_lvbo_fill() complains that the lvblen (240) is small to hold the EA (sized 264). It looks like mdt_intent_layout() hasn't set RMF_DLM_LVB big enough. mdt_intent_layout() if (mdt_object_exists(obj) && !mdt_object_remote(obj)) { /* if layout is going to be changed don't use the current EA * size but the maximum one. That buffer will be shrinked * to the actual size in req_capsule_shrink() before reply. */ if (layout.mlc_opc == MD_LAYOUT_WRITE) { layout_size = info->mti_mdt->mdt_max_mdsize; } else { layout_size = mdt_attr_get_eabuf_size(info, obj); if (layout_size < 0) GOTO(out_obj, rc = layout_size); if (layout_size > info->mti_mdt->mdt_max_mdsize) info->mti_mdt->mdt_max_mdsize = layout_size; } } So I tried to change the default mdt_max_mdsize to a bigger size, and the test passed. diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h index 7999816676..20d13cb4f6 100644 --- a/lustre/include/uapi/linux/lustre/lustre_idl.h +++ b/lustre/include/uapi/linux/lustre/lustre_idl.h @@ -1117,7 +1117,7 @@ struct lov_mds_md_v1 {            /* LOV EA mds/wire data (little-endian) */         struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */ }; -#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data)) +#define MAX_MD_SIZE (sizeof(struct lov_comp_md_v1) + 4 *               \ +                       (sizeof(struct lov_comp_md_entry_v1) +          \ +                        (sizeof(struct lov_mds_md) + 4 *               \ +                         sizeof(struct lov_ost_data)))) #define MIN_MD_SIZE (sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data)) /* This is the default MDT reply size allocated, should the striping be bigger,

            As I see, write intent has transno and is replayed, there is also code to support replay in MDT/MDD/LOD but it is broken somewhere in the middle it seems. I am checking whole code patch right now.

            tappro Mikhail Pershin added a comment - As I see, write intent has transno and is replayed, there is also code to support replay in MDT/MDD/LOD but it is broken somewhere in the middle it seems. I am checking whole code patch right now.

            It seems likely only original open replay RPC is being sent, and it contains only the first component of the layout (which is always initialized at open). The write intent RPC that is causing the later components to be ibitialized is not being replayed at all.

            This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

            adilger Andreas Dilger added a comment - It seems likely only original open replay RPC is being sent, and it contains only the first component of the layout (which is always initialized at open). The write intent RPC that is causing the later components to be ibitialized is not being replayed at all. This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

            I suspect that replay itself has been done as intended but replay data is wrong and new layout is being rewritten with old data. It seems that replay has old layout in replay data instead of the new one. At least I see in logs that layout was changed on client to the older generation.

            tappro Mikhail Pershin added a comment - I suspect that replay itself has been done as intended but replay data is wrong and new layout is being rewritten with old data. It seems that replay has old layout in replay data instead of the new one. At least I see in logs that layout was changed on client to the older generation.

            this is just a reproducer I've added to replay-single.sh locally, I didn't push it into gerrit.

            tappro Mikhail Pershin added a comment - this is just a reproducer I've added to replay-single.sh locally, I didn't push it into gerrit.

            People

              bobijam Zhenyu Xu
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: