Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.1
-
None
-
configuration with router, MDS_FS_MKFS_OPTS="-O large_xattr", sanity-dom/sanityn
-
3
-
9223372036854775807
Description
During sanity-dom testing the next issue was appeared:
[ 1644.726837] LNetError: 3137:0:(lib-move.c:4143:lnet_parse()) 192.168.8.1@tcp, src 192.168.8.1@tcp: bad PUT payload 1051832 (1048576 max expected)
I've added a bit debug to take vmcores from a sender.
Here is analyze from crash
md = { start = 0xffff880098200100, length = 1051832, threshold = 0, max_size = -1742733056, options = 0, user_ptr = 0xffff880098200000, eq_handle = { cookie = 23 }, bulk_handle = { cookie = 0 } }, msg_niov = 1, msg_iov = 0xffff88009995aba0, msg_kiov = 0x0, ffff880135be9800 rc_fmt = 0xffffffffc095d080 <RQF_LDLM_INTENT_OPEN>, static const struct req_msg_field *ldlm_intent_open_server[] = { &RMF_PTLRPC_BODY, &RMF_DLM_REP, &RMF_MDT_BODY, &RMF_MDT_MD, &RMF_ACL, &RMF_CAPA1, &RMF_CAPA2, &RMF_NIOBUF_INLINE, }; rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {184, 112, 216, 2432, 260, 0, 0, 1048592, 4294967295, 4294967295}} } crash> p 184+112+216+2432+260+1048592 $15 = 1051796
The DOM size during open was 1Mb, the total length of lnet request was 1051796, and it doesn't fit at LNET_MTU limit. So the router shows error.
This brings us to problem when we cannot handle 1Mb stripe size DOM at LNET layer. I think it is a problem for PFL when a first stripe located at MDS, probably.
The workaround for sanity-dom testing is to decrease DOM_SIZE at sanity-dom.sh
Also MDS should limit this size to prevent such misbehavior.
I've assigned this to Mikhail, I'm not sure.