Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.1
-
None
-
configuration with router, MDS_FS_MKFS_OPTS="-O large_xattr", sanity-dom/sanityn
-
3
-
9223372036854775807
Description
During sanity-dom testing the next issue was appeared:
[ 1644.726837] LNetError: 3137:0:(lib-move.c:4143:lnet_parse()) 192.168.8.1@tcp, src 192.168.8.1@tcp: bad PUT payload 1051832 (1048576 max expected)
I've added a bit debug to take vmcores from a sender.
Here is analyze from crash
md = {
start = 0xffff880098200100,
length = 1051832,
threshold = 0,
max_size = -1742733056,
options = 0,
user_ptr = 0xffff880098200000,
eq_handle = {
cookie = 23
},
bulk_handle = {
cookie = 0
}
},
msg_niov = 1,
msg_iov = 0xffff88009995aba0,
msg_kiov = 0x0,
ffff880135be9800
rc_fmt = 0xffffffffc095d080 <RQF_LDLM_INTENT_OPEN>,
static const struct req_msg_field *ldlm_intent_open_server[] = {
&RMF_PTLRPC_BODY,
&RMF_DLM_REP,
&RMF_MDT_BODY,
&RMF_MDT_MD,
&RMF_ACL,
&RMF_CAPA1,
&RMF_CAPA2,
&RMF_NIOBUF_INLINE,
};
rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {184, 112, 216, 2432, 260, 0, 0, 1048592, 4294967295, 4294967295}}
}
crash> p 184+112+216+2432+260+1048592
$15 = 1051796
The DOM size during open was 1Mb, the total length of lnet request was 1051796, and it doesn't fit at LNET_MTU limit. So the router shows error.
This brings us to problem when we cannot handle 1Mb stripe size DOM at LNET layer. I think it is a problem for PFL when a first stripe located at MDS, probably.
The workaround for sanity-dom testing is to decrease DOM_SIZE at sanity-dom.sh
Also MDS should limit this size to prevent such misbehavior.
I've assigned this to Mikhail, I'm not sure.