[LU-11158] PFL component instantiation is not replayed properly - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While investigating ~~LU-10961~~ I have found that component instantiation is not replayed. Test showing the problem:

test_132a() {
	[ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.9.90) ] &&
		skip "Do not support PFL files before 2.10"

	$LFS setstripe -E 1M -c 1 -E EOF -c 2 $DIR/$tfile
	replay_barrier $SINGLEMDS
	# write over the first component size cause next component instantiation
	dd if=/dev/urandom of=$DIR/$tfile bs=1M count=1 seek=1 ||
		error "dd to $DIR/$tfile failed"

	cksum=$(md5sum $DIR/$tfile | awk '{print $1}')
	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
		error "Component #1 was not instantiated"

	fail $SINGLEMDS

	cksum2=$(md5sum $DIR/$tfile | awk '{print $1}')
	if [ $cksum != $cksum2 ] ; then
		error_noexit "New checksum $cksum2 does not match original $cksum"
	fi
	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
		error "Component #1 instantiation was not replayed"
}
run_test 132a "PFL new component instantiate replay"

it is double checked here - with checksums and by checking that next component has lmm_objects assigned. Both are failing in master.

Attachments

Issue Links

is blocking

LU-10961 Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

Resolved

is related to

LU-10686 sanity-pfl test 9 fails with “[0x100010000:0x6025:0x0] != “

Resolved

Activity

[LU-11158] PFL component instantiation is not replayed properly

Jinshan Xiong added a comment - 19/Jul/18 8:20 PM

Yes, there usually exists layout intent in file's layout, but there also exist cases that the file only has partially layout defined.

The philosophy behind the design is that the MDS should decide what layout it will allocate and how many components it should instantiate, so client technically doesn't know the actual EA size. Does this make sense to you?

Jinshan Xiong added a comment - 19/Jul/18 8:20 PM Yes, there usually exists layout intent in file's layout, but there also exist cases that the file only has partially layout defined. The philosophy behind the design is that the MDS should decide what layout it will allocate and how many components it should instantiate, so client technically doesn't know the actual EA size. Does this make sense to you?

Mikhail Pershin added a comment - 19/Jul/18 7:17 PM - edited

on other hand I wonder why client can't supply correct EA size when updating layout? It knows the size, doesn't it? I mean reply buffer on client side can be allocated with proper size.

Mikhail Pershin added a comment - 19/Jul/18 7:17 PM - edited on other hand I wonder why client can't supply correct EA size when updating layout? It knows the size, doesn't it? I mean reply buffer on client side can be allocated with proper size.

Mikhail Pershin added a comment - 19/Jul/18 7:15 PM

IIRC, the mdt_lvbo_fill() may skip the EA getting just because something like that - "we can do nothing here, let's report new EA size back and there will be separate getxattr RPC". That is not working with RPCs to be replayed though.

Mikhail Pershin added a comment - 19/Jul/18 7:15 PM IIRC, the mdt_lvbo_fill() may skip the EA getting just because something like that - "we can do nothing here, let's report new EA size back and there will be separate getxattr RPC". That is not working with RPCs to be replayed though.

Jinshan Xiong added a comment - 19/Jul/18 5:37 PM

bobi - there were some good discussion about this problem before, but I don't remember the ticket number. Probably we should pick it up and make a complete solution I proposed there. The real problem here is because 'mdt_max_mdsize' keeps increasing by the current max mdsize it sees on the MDS(IIRC), which causes some problem for layout write when the system starts.

Jinshan Xiong added a comment - 19/Jul/18 5:37 PM bobi - there were some good discussion about this problem before, but I don't remember the ticket number. Probably we should pick it up and make a complete solution I proposed there. The real problem here is because 'mdt_max_mdsize' keeps increasing by the current max mdsize it sees on the MDS(IIRC), which causes some problem for layout write when the system starts.

Jinshan Xiong added a comment - 19/Jul/18 5:32 PM

This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

If an open RPC has been committed, should we just ignore the layout in the replay RPC? In another word, it will just restore the open context on the MDS side.

RPCs that modify layout components are just regular REINT RPCs. As long as they are committed, they will be gone, no need to replay them.

Jinshan Xiong added a comment - 19/Jul/18 5:32 PM This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file. If an open RPC has been committed, should we just ignore the layout in the replay RPC? In another word, it will just restore the open context on the MDS side. RPCs that modify layout components are just regular REINT RPCs. As long as they are committed, they will be gone, no need to replay them.

Zhenyu Xu added a comment - 19/Jul/18 2:43 PM

I don't know whether it is right or not that mdt_lvbo_fill() returns 0 when the LVBO buffer is smaller than the necessary EA size.

Zhenyu Xu added a comment - 19/Jul/18 2:43 PM I don't know whether it is right or not that mdt_lvbo_fill() returns 0 when the LVBO buffer is smaller than the necessary EA size.

Zhenyu Xu added a comment - 19/Jul/18 2:15 PM - edited

I find that the mdt_lvbo_fill() complains that the lvblen (240) is small to hold the EA (sized 264). It looks like mdt_intent_layout() hasn't set RMF_DLM_LVB big enough.

mdt_intent_layout()

        if (mdt_object_exists(obj) && !mdt_object_remote(obj)) {
                /* if layout is going to be changed don't use the current EA
                 * size but the maximum one. That buffer will be shrinked
                 * to the actual size in req_capsule_shrink() before reply.
                 */
                if (layout.mlc_opc == MD_LAYOUT_WRITE) {
                        layout_size = info->mti_mdt->mdt_max_mdsize;
                } else {
                        layout_size = mdt_attr_get_eabuf_size(info, obj);
                        if (layout_size < 0)
                                GOTO(out_obj, rc = layout_size);

                        if (layout_size > info->mti_mdt->mdt_max_mdsize)
                                info->mti_mdt->mdt_max_mdsize = layout_size;
                }
        }

So I tried to change the default mdt_max_mdsize to a bigger size, and the test passed.

diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h
index 7999816676..20d13cb4f6 100644
--- a/lustre/include/uapi/linux/lustre/lustre_idl.h
+++ b/lustre/include/uapi/linux/lustre/lustre_idl.h
@@ -1117,7 +1117,7 @@ struct lov_mds_md_v1 {            /* LOV EA mds/wire data (little-endian) */
        struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */
 };
 
-#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data))
+#define MAX_MD_SIZE (sizeof(struct lov_comp_md_v1) + 4 *               \
+                       (sizeof(struct lov_comp_md_entry_v1) +          \
+                        (sizeof(struct lov_mds_md) + 4 *               \
+                         sizeof(struct lov_ost_data))))
 #define MIN_MD_SIZE (sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data))
 
 /* This is the default MDT reply size allocated, should the striping be bigger,

Zhenyu Xu added a comment - 19/Jul/18 2:15 PM - edited I find that the mdt_lvbo_fill() complains that the lvblen (240) is small to hold the EA (sized 264). It looks like mdt_intent_layout() hasn't set RMF_DLM_LVB big enough. mdt_intent_layout() if (mdt_object_exists(obj) && !mdt_object_remote(obj)) { /* if layout is going to be changed don't use the current EA * size but the maximum one. That buffer will be shrinked * to the actual size in req_capsule_shrink() before reply. */ if (layout.mlc_opc == MD_LAYOUT_WRITE) { layout_size = info->mti_mdt->mdt_max_mdsize; } else { layout_size = mdt_attr_get_eabuf_size(info, obj); if (layout_size < 0) GOTO(out_obj, rc = layout_size); if (layout_size > info->mti_mdt->mdt_max_mdsize) info->mti_mdt->mdt_max_mdsize = layout_size; } } So I tried to change the default mdt_max_mdsize to a bigger size, and the test passed. diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h index 7999816676..20d13cb4f6 100644 --- a/lustre/include/uapi/linux/lustre/lustre_idl.h +++ b/lustre/include/uapi/linux/lustre/lustre_idl.h @@ -1117,7 +1117,7 @@ struct lov_mds_md_v1 { /* LOV EA mds/wire data (little-endian) */ struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */ }; -#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data)) +#define MAX_MD_SIZE (sizeof(struct lov_comp_md_v1) + 4 * \ + (sizeof(struct lov_comp_md_entry_v1) + \ + (sizeof(struct lov_mds_md) + 4 * \ + sizeof(struct lov_ost_data)))) #define MIN_MD_SIZE (sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data)) /* This is the default MDT reply size allocated, should the striping be bigger,

Mikhail Pershin added a comment - 19/Jul/18 2:13 PM

As I see, write intent has transno and is replayed, there is also code to support replay in MDT/MDD/LOD but it is broken somewhere in the middle it seems. I am checking whole code patch right now.

Mikhail Pershin added a comment - 19/Jul/18 2:13 PM As I see, write intent has transno and is replayed, there is also code to support replay in MDT/MDD/LOD but it is broken somewhere in the middle it seems. I am checking whole code patch right now.

Andreas Dilger added a comment - 19/Jul/18 7:01 AM

It seems likely only original open replay RPC is being sent, and it contains only the first component of the layout (which is always initialized at open). The write intent RPC that is causing the later components to be ibitialized is not being replayed at all.

This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

Andreas Dilger added a comment - 19/Jul/18 7:01 AM It seems likely only original open replay RPC is being sent, and it contains only the first component of the layout (which is always initialized at open). The write intent RPC that is causing the later components to be ibitialized is not being replayed at all. This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

Mikhail Pershin added a comment - 19/Jul/18 6:01 AM

I suspect that replay itself has been done as intended but replay data is wrong and new layout is being rewritten with old data. It seems that replay has old layout in replay data instead of the new one. At least I see in logs that layout was changed on client to the older generation.

Mikhail Pershin added a comment - 19/Jul/18 6:01 AM I suspect that replay itself has been done as intended but replay data is wrong and new layout is being rewritten with old data. It seems that replay has old layout in replay data instead of the new one. At least I see in logs that layout was changed on client to the older generation.

Mikhail Pershin added a comment - 19/Jul/18 5:00 AM

this is just a reproducer I've added to replay-single.sh locally, I didn't push it into gerrit.

Mikhail Pershin added a comment - 19/Jul/18 5:00 AM this is just a reproducer I've added to replay-single.sh locally, I didn't push it into gerrit.

People

Assignee:: Zhenyu Xu

Reporter:: Mikhail Pershin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 18/Jul/18 11:39 PM

Updated:: 27/Feb/19 2:06 PM

Resolved:: 29/Oct/18 4:14 PM