[LU-11158] PFL component instantiation is not replayed properly Created: 18/Jul/18  Updated: 27/Feb/19  Resolved: 29/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocking LU-10961 Clients hang after failovers. LustreE... Resolved
Related
is related to LU-10686 sanity-pfl test 9 fails with “[0x1000... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While investigating LU-10961 I have found that component instantiation is not replayed. Test showing the problem:

test_132a() {
	[ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.9.90) ] &&
		skip "Do not support PFL files before 2.10"

	$LFS setstripe -E 1M -c 1 -E EOF -c 2 $DIR/$tfile
	replay_barrier $SINGLEMDS
	# write over the first component size cause next component instantiation
	dd if=/dev/urandom of=$DIR/$tfile bs=1M count=1 seek=1 ||
		error "dd to $DIR/$tfile failed"

	cksum=$(md5sum $DIR/$tfile | awk '{print $1}')
	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
		error "Component #1 was not instantiated"

	fail $SINGLEMDS

	cksum2=$(md5sum $DIR/$tfile | awk '{print $1}')
	if [ $cksum != $cksum2 ] ; then
		error_noexit "New checksum $cksum2 does not match original $cksum"
	fi
	$LFS getstripe -I2 $DIR/$tfile | grep -q lmm_objects ||
		error "Component #1 instantiation was not replayed"
}
run_test 132a "PFL new component instantiate replay"

it is double checked here - with checksums and by checking that next component has lmm_objects assigned. Both are failing in master.



 Comments   
Comment by Andreas Dilger [ 19/Jul/18 ]

It sounds like the component intstantiation RPC needs to be assigned a transno and saved on the client for replay.

Comment by Andreas Dilger [ 19/Jul/18 ]

Mike, I didn't find test_132a in any test script. Is this added in a patch, or a test you wrote for reproducing this problem?

Comment by Mikhail Pershin [ 19/Jul/18 ]

this is just a reproducer I've added to replay-single.sh locally, I didn't push it into gerrit.

Comment by Mikhail Pershin [ 19/Jul/18 ]

I suspect that replay itself has been done as intended but replay data is wrong and new layout is being rewritten with old data. It seems that replay has old layout in replay data instead of the new one. At least I see in logs that layout was changed on client to the older generation.

Comment by Andreas Dilger [ 19/Jul/18 ]

It seems likely only original open replay RPC is being sent, and it contains only the first component of the layout (which is always initialized at open). The write intent RPC that is causing the later components to be ibitialized is not being replayed at all.

This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

Comment by Mikhail Pershin [ 19/Jul/18 ]

As I see, write intent has transno and is replayed, there is also code to support replay in MDT/MDD/LOD but it is broken somewhere in the middle it seems. I am checking whole code patch right now.

Comment by Zhenyu Xu [ 19/Jul/18 ]

I find that the mdt_lvbo_fill() complains that the lvblen (240) is small to hold the EA (sized 264). It looks like mdt_intent_layout() hasn't set RMF_DLM_LVB big enough.

mdt_intent_layout()
        if (mdt_object_exists(obj) && !mdt_object_remote(obj)) {
                /* if layout is going to be changed don't use the current EA
                 * size but the maximum one. That buffer will be shrinked
                 * to the actual size in req_capsule_shrink() before reply.
                 */
                if (layout.mlc_opc == MD_LAYOUT_WRITE) {
                        layout_size = info->mti_mdt->mdt_max_mdsize;
                } else {
                        layout_size = mdt_attr_get_eabuf_size(info, obj);
                        if (layout_size < 0)
                                GOTO(out_obj, rc = layout_size);

                        if (layout_size > info->mti_mdt->mdt_max_mdsize)
                                info->mti_mdt->mdt_max_mdsize = layout_size;
                }
        }

So I tried to change the default mdt_max_mdsize to a bigger size, and the test passed.

diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h
index 7999816676..20d13cb4f6 100644
--- a/lustre/include/uapi/linux/lustre/lustre_idl.h
+++ b/lustre/include/uapi/linux/lustre/lustre_idl.h
@@ -1117,7 +1117,7 @@ struct lov_mds_md_v1 {            /* LOV EA mds/wire data (little-endian) */
        struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */
 };
 
-#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data))
+#define MAX_MD_SIZE (sizeof(struct lov_comp_md_v1) + 4 *               \
+                       (sizeof(struct lov_comp_md_entry_v1) +          \
+                        (sizeof(struct lov_mds_md) + 4 *               \
+                         sizeof(struct lov_ost_data))))
 #define MIN_MD_SIZE (sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data))
 
 /* This is the default MDT reply size allocated, should the striping be bigger, 
Comment by Zhenyu Xu [ 19/Jul/18 ]

I don't know whether it is right or not that mdt_lvbo_fill() returns 0 when the LVBO buffer is smaller than the necessary EA size.

Comment by Jinshan Xiong [ 19/Jul/18 ]

This write intent RPC may be sent weeks or years after the initial open, so it doesn't make sense to modify the layout stored with the initial open (which may not even exist on this same client). We need to replay each of the RPCs that caused the new component to be initialized, whichever client sent it. This is similar to one client getting the open replay transno even though many clients tried to create the same file.

 

If an open RPC has been committed, should we just ignore the layout in the replay RPC? In another word, it will just restore the open context on the MDS side.

RPCs that modify layout components are just regular REINT RPCs. As long as they are committed, they will be gone, no need to replay them.

Comment by Jinshan Xiong [ 19/Jul/18 ]

bobi - there were some good discussion about this problem before, but I don't remember the ticket number. Probably we should pick it up and make a complete solution I proposed there. The real problem here is because 'mdt_max_mdsize' keeps increasing by the current max mdsize it sees on the MDS(IIRC), which causes some problem for layout write when the system starts.

Comment by Mikhail Pershin [ 19/Jul/18 ]

IIRC, the mdt_lvbo_fill() may skip the EA getting just because something like that - "we can do nothing here, let's report new EA size back and there will be separate getxattr RPC". That is not working with RPCs to be replayed though.

Comment by Mikhail Pershin [ 19/Jul/18 ]

on other hand I wonder why client can't supply correct EA size when updating layout? It knows the size, doesn't it? I mean reply buffer on client side can be allocated with proper size.

Comment by Jinshan Xiong [ 19/Jul/18 ]

Yes, there usually exists layout intent in file's layout, but there also exist cases that the file only has partially layout defined.

 

The philosophy behind the design is that the MDS should decide what layout it will allocate and how many components it should instantiate, so client technically doesn't know the actual EA size. Does this make sense to you?

Comment by Mikhail Pershin [ 19/Jul/18 ]

Do you mean, for example, that new component may have defined size but no stripe count, etc. And MDS will complete that and provide final layout. Yes, that makes sense. I think we have here couple options, first, we can try to grow reply buffer for modification cases so layout will fill into it, second option here is quite non-trivial but still - what if MDS will return not whole layout in reply but just new component data? Considering that we have EX lock and new component instantiation doesn't change earlier components that should work and would require less reply size. Just thoughts, probably I am missing something here.

Comment by Mikhail Pershin [ 19/Jul/18 ]

Also, speaking about decision on client side - while MDS creates layout, the client still may predict its size quite correctly if number of stripes is known because layout size depends mostly on that, if it is not specified then client may allocate large reply buffer for some amount of stripes and MDS may consider that while creating new component.

I wasn't participating that discussion you have mentioned, maybe there are good solution already.

Comment by Gerrit Updater [ 20/Jul/18 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32847
Subject: LU-11158 mdt: grow lvb buffer to hold layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef68d2e72f7fad7049594d14a78dda143fc0f736

Comment by Gerrit Updater [ 29/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32847/
Subject: LU-11158 mdt: grow lvb buffer to hold layout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5abcf83c0575b8a79594c1eb9ea727739d91522

Comment by Peter Jones [ 29/Oct/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 16/Jan/19 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/34049
Subject: LU-11158 mdt: grow lvb buffer to hold layout
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: ec5ce00e9c1d95a178a9ea5bf6cd2b26e0e28837

Comment by Gerrit Updater [ 15/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34049/
Subject: LU-11158 mdt: grow lvb buffer to hold layout
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: a1d1006a5e2bd7ba3dd9096107c456b353a3eeb0

Generated at Sat Feb 10 02:41:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.