I test this bug again in 2.10.1-RC1(no add any patch).
It looks to be network IO on mdt1 timed out, could you verify the network on mdt1 is working correctly?
=>all osp state looks like normal
[mdt1 server]
./osp/jlustre-MDT0000-osp-MDT0001/state:current_state: FULL
./osp/jlustre-MDT0000-osp-MDT0001/import: state: FULL
./osp/jlustre-OST0000-osc-MDT0001/state:current_state: FULL
./osp/jlustre-OST0000-osc-MDT0001/import: state: FULL
[mdt0 server]
./osp/jlustre-MDT0001-osp-MDT0000/state:current_state: FULL
./osp/jlustre-MDT0001-osp-MDT0000/import: state: FULL
./osp/jlustre-OST0000-osc-MDT0000/state:current_state: FULL
./osp/jlustre-OST0000-osc-MDT0000/import: state: FULL
[/var/log/message in mdt0 server]
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
Sep 19 02:50:14 ossb2 kernel: LNetError: 21764:0:(o2iblnd_cb.c:1554:kiblnd_send()) Can't setup GET sink for 172.20.110.209@o2ib: -22
Sep 19 02:50:14 ossb2 kernel: LustreError: 21764:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff88086ea2e400
Sep 19 02:51:54 ossb2 kernel: LustreError: 21764:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880457208c50 x1578948605516272/t0(0) o1000->jlustre-MDT0001-mdtlov_UUID@172.20.110.209@o2ib:210/0 lens 376/0 e 4 to 0 dl 1505803920 ref 1 fl Interpret:/0/ffffffff rc 0/-1
[/var/log/messages in mdt1 server]
Sep 19 14:51:22 ossb1 kernel: LustreError: 11-0: jlustre-MDT0000-osp-MDT0001: operation out_update to node 172.20.110.210@o2ib failed: rc = -110
Sep 19 14:51:22 ossb1 kernel: LustreError: 31069:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)#012 req@ffff8807d3aa7800 x1578948605516272/t0(0) o1000->jlustre-MDT0000-osp-MDT0001@172.20.110.210@o2ib:24/4 lens 376/192 e 4 to 0 dl 1505803889 ref 2 fl Interpret:ReM/0/0 rc -110/-110
Sep 19 14:51:24 ossb1 kernel: LustreError: 30780:0:(llog_cat.c:773:llog_cat_cancel_records()) jlustre-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog
Good news - thanks