Dec 20 07:02:45 mds14 kernel: Lustre: 73277:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257025/real 1545257025] req@ffff8f966e2fce00 x1620318878189712/t0(0) o101->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:24/4 lens 328/344 e 0 to 1 dl 1545257036 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:45 mds14 kernel: Lustre: 73277:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 20 07:02:45 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:02:47 mds14 kernel: Lustre: 71209:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257025/real 1545257025] req@ffff8f91c3cc2d00 x1620318878165152/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257036 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:50 mds14 kernel: Lustre: 71252:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257029/real 1545257029] req@ffff8f9784084200 x1620318878486384/t0(0) o41->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:24/4 lens 224/368 e 0 to 1 dl 1545257040 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:50 mds14 kernel: Lustre: 71252:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2143 previous similar messages Dec 20 07:03:10 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:03:19 mds14 kernel: Lustre: 71253:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257025/real 1545257025] req@ffff8f9784081e00 x1620318878476560/t0(0) o400->scratch0-MDT0000-lwp-MDT0001@10.0.11.225@o2ib10:12/10 lens 224/224 e 0 to 1 dl 1545257069 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 20 07:03:19 mds14 kernel: Lustre: 71253:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 963 previous similar messages Dec 20 07:03:19 mds14 kernel: Lustre: scratch0-MDT0000-lwp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:03:20 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:03:33 mds14 kernel: LustreError: 71182:0:(events.c:450:server_bulk_callback()) event type 5, status -61, desc ffff8f8c46b58000 Dec 20 07:03:33 mds14 kernel: LustreError: 71182:0:(events.c:450:server_bulk_callback()) event type 3, status -61, desc ffff8f8c46b58000 Dec 20 07:03:33 mds14 kernel: LustreError: 74104:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8f97d6d6a050 x1620318877676672/t0(0) o1000->scratch0-MDT0000-mdtlov_UUID@10.0.11.225@o2ib10:99/0 lens 368/0 e 0 to 0 dl 1545257089 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 20 07:03:45 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:03:52 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:03:52 mds14 kernel: Lustre: scratch0-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:03:52 mds14 kernel: Lustre: Skipped 31 previous similar messages Dec 20 07:04:05 mds14 kernel: Lustre: 72413:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257102/real 1545257102] req@ffff8f98d2a31e00 x1620318883868224/t0(0) o1000->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257115 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:04:05 mds14 kernel: Lustre: 72413:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Dec 20 07:04:05 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:04:16 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:04:16 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:04:16 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1545257126/real 1545257126] req@ffff8f9682ef1e00 x1620318878125680/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257143 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Dec 20 07:04:16 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 816 previous similar messages Dec 20 07:04:16 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:04:16 mds14 kernel: LNetError: 71167:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0) Dec 20 07:04:26 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:04:37 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:04:37 mds14 kernel: Lustre: scratch0-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:04:37 mds14 kernel: Lustre: Skipped 5 previous similar messages Dec 20 07:04:41 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:04:41 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (3): c: 0, oc: 1, rc: 63 Dec 20 07:04:45 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257144/real 0] req@ffff8f97b63eda00 x1620318887556432/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257155 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:04:45 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 44 previous similar messages Dec 20 07:04:45 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:04:45 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:04:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:04:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 2 previous similar messages Dec 20 07:04:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:04:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 2 previous similar messages Dec 20 07:04:47 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:04:47 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:04:54 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Dec 20 07:04:54 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (7): c: 0, oc: 0, rc: 63 Dec 20 07:05:00 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:05:00 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:05:00 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:05:00 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:05:06 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:05:06 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 2 previous similar messages Dec 20 07:05:06 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:05:06 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 2 previous similar messages Dec 20 07:05:06 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:05:06 mds14 kernel: LNet: 73058:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:05:18 mds14 kernel: Lustre: 72413:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257140/real 0] req@ffff8f989fe9b900 x1620318887181088/t0(0) o1000->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:24/4 lens 4072/4320 e 0 to 1 dl 1545257188 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:05:18 mds14 kernel: Lustre: 72413:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4994 previous similar messages Dec 20 07:05:19 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:05:19 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 4 previous similar messages Dec 20 07:05:19 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:05:19 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 4 previous similar messages Dec 20 07:05:19 mds14 kernel: LNet: 73059:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:05:19 mds14 kernel: LNet: 73059:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:05:25 mds14 kernel: Lustre: scratch0-MDT0000-lwp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:05:25 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:05:30 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:05:55 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:06:07 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:06:07 mds14 kernel: Lustre: scratch0-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:06:23 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257237/real 1545257237] req@ffff8f9662671200 x1620318887468816/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257254 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Dec 20 07:06:23 mds14 kernel: Lustre: 71207:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Dec 20 07:06:23 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:06:48 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:06:48 mds14 kernel: LustreError: Skipped 1 previous similar message Dec 20 07:07:13 mds14 kernel: LNet: 73059:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:07:13 mds14 kernel: LNet: 73059:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 7 previous similar messages Dec 20 07:07:13 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:07:13 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:07:33 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:07:39 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:07:39 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 6 previous similar messages Dec 20 07:07:39 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (1): c: 0, oc: 0, rc: 63 Dec 20 07:07:39 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 6 previous similar messages Dec 20 07:07:45 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:07:45 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:08:30 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:08:30 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:10:01 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:10:01 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:10:01 mds14 systemd[1]: Started Session 324 of user root. Dec 20 07:10:01 mds14 systemd[1]: Starting Session 324 of user root. Dec 20 07:10:01 mds14 systemd[1]: Removed slice User Slice of root. Dec 20 07:10:01 mds14 systemd[1]: Stopping User Slice of root. Dec 20 07:08:59 mds14 kernel: Lustre: 71210:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257385/real 1545257385] req@ffff8f973dde2100 x1620318896264768/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257409 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:08:59 mds14 kernel: Lustre: 71210:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1038 previous similar messages Dec 20 07:08:59 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:09:01 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:09:26 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:09:26 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:09:26 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:09:51 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:09:51 mds14 kernel: Lustre: Skipped 5 previous similar messages Dec 20 07:09:57 mds14 kernel: LNetError: 71167:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0) Dec 20 07:09:57 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:10:09 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:10:09 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:10:18 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:10:18 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 4 previous similar messages Dec 20 07:10:18 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (5): c: 0, oc: 0, rc: 63 Dec 20 07:10:18 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 4 previous similar messages Dec 20 07:10:27 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:10:36 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:10:36 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 7 previous similar messages Dec 20 07:11:10 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:11:10 mds14 kernel: Lustre: Skipped 2 previous similar messages Dec 20 07:11:23 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Dec 20 07:11:23 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 23 previous similar messages Dec 20 07:11:23 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (7): c: 0, oc: 0, rc: 63 Dec 20 07:11:23 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 23 previous similar messages Dec 20 07:11:47 mds14 kernel: LNet: 71183:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.225@o2ib10 Dec 20 07:12:07 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:13:29 mds14 kernel: Lustre: 71208:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257673/real 1545257673] req@ffff8f97cefa5700 x1620318897841024/t0(0) o103->scratch0-MDT0000-osp-MDT0001@10.0.11.225@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257680 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Dec 20 07:13:29 mds14 kernel: Lustre: 71208:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 12694 previous similar messages Dec 20 07:13:29 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:13:29 mds14 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:14:43 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:14:43 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:14:43 mds14 systemd-logind[2963]: New session 325 of user root. Dec 20 07:14:43 mds14 systemd[1]: Started Session 325 of user root. Dec 20 07:14:43 mds14 systemd[1]: Starting Session 325 of user root. Dec 20 07:13:36 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:13:36 mds14 kernel: LustreError: 72238:0:(mgc_request.c:596:do_requeue()) failed processing log: -5 Dec 20 07:14:27 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:14:27 mds14 kernel: Lustre: Skipped 9 previous similar messages Dec 20 07:14:57 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:15:14 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Dec 20 07:15:14 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 11 previous similar messages Dec 20 07:15:14 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (4): c: 0, oc: 0, rc: 63 Dec 20 07:15:14 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 11 previous similar messages Dec 20 07:15:20 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:15:20 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 31 previous similar messages Dec 20 07:15:33 mds14 kernel: LNet: 71183:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.225@o2ib10 Dec 20 07:15:58 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:15:58 mds14 kernel: LustreError: Skipped 2 previous similar messages Dec 20 07:16:14 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:16:18 mds14 kernel: LustreError: 72238:0:(mgc_request.c:596:do_requeue()) failed processing log: -5 Dec 20 07:17:43 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:17:44 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:17:44 mds14 kernel: Lustre: Skipped 4 previous similar messages Dec 20 07:20:01 mds14 systemd[1]: Started Session 326 of user root. Dec 20 07:20:01 mds14 systemd[1]: Starting Session 326 of user root. Dec 20 07:19:11 mds14 kernel: Lustre: scratch0-MDT0000-lwp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:19:11 mds14 kernel: Lustre: Skipped 5 previous similar messages Dec 20 07:20:33 mds14 systemd-logind[2963]: Removed session 325. Dec 20 07:20:33 mds14 systemd[1]: Removed slice User Slice of root. Dec 20 07:20:33 mds14 systemd[1]: Stopping User Slice of root. Dec 20 07:21:25 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:21:25 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:21:25 mds14 systemd-logind[2963]: New session 327 of user root. Dec 20 07:21:25 mds14 systemd[1]: Started Session 327 of user root. Dec 20 07:21:25 mds14 systemd[1]: Starting Session 327 of user root. Dec 20 07:20:33 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.225@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:20:33 mds14 kernel: LNet: 74627:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 35 previous similar messages Dec 20 07:20:38 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:20:38 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 33 previous similar messages Dec 20 07:20:38 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (5): c: 0, oc: 63, rc: 63 Dec 20 07:20:38 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 33 previous similar messages Dec 20 07:20:40 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:21:51 mds14 systemd-logind[2963]: Removed session 327. Dec 20 07:21:51 mds14 systemd[1]: Removed slice User Slice of root. Dec 20 07:21:51 mds14 systemd[1]: Stopping User Slice of root. Dec 20 07:20:53 mds14 kernel: LustreError: 72339:0:(ldlm_lib.c:3258:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff8f97d6d6b050 x1620318905781376/t0(0) o1000->scratch0-MDT0000-mdtlov_UUID@10.0.11.225@o2ib10:421/0 lens 368/0 e 0 to 0 dl 1545258166 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 20 07:21:29 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:21:29 mds14 kernel: LustreError: Skipped 2 previous similar messages Dec 20 07:23:08 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:23:08 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:23:08 mds14 systemd-logind[2963]: New session 328 of user root. Dec 20 07:23:08 mds14 systemd[1]: Started Session 328 of user root. Dec 20 07:23:08 mds14 systemd[1]: Starting Session 328 of user root. Dec 20 07:22:12 mds14 kernel: Lustre: 71252:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257717/real 1545257717] req@ffff8f982bb36c00 x1620318897956784/t0(0) o400->scratch0-MDT0000-lwp-MDT0001@10.0.11.225@o2ib10:12/10 lens 224/224 e 0 to 1 dl 1545258202 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 20 07:22:12 mds14 kernel: Lustre: 71252:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 11594 previous similar messages Dec 20 07:23:39 mds14 kernel: Lustre: scratch0-MDT0001: Received new LWP connection from 10.0.11.225@o2ib10, removing former export from same NID Dec 20 07:23:39 mds14 kernel: Lustre: Skipped 6 previous similar messages Dec 20 07:23:39 mds14 kernel: Lustre: scratch0-MDT0001: Connection restored to 10.0.11.225@o2ib10 (at 10.0.11.225@o2ib10) Dec 20 07:23:39 mds14 kernel: Lustre: Skipped 26 previous similar messages Dec 20 07:26:30 mds14 systemd-logind[2963]: Removed session 328. Dec 20 07:26:30 mds14 systemd[1]: Removed slice User Slice of root. Dec 20 07:26:30 mds14 systemd[1]: Stopping User Slice of root. Dec 20 07:26:11 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:26:11 mds14 kernel: LustreError: Skipped 1 previous similar message Dec 20 07:26:15 mds14 kernel: LNet: 71182:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.225@o2ib10 Dec 20 07:30:01 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:30:01 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:30:01 mds14 systemd[1]: Started Session 329 of user root. Dec 20 07:30:01 mds14 systemd[1]: Starting Session 329 of user root. Dec 20 07:30:01 mds14 systemd[1]: Removed slice User Slice of root. Dec 20 07:30:01 mds14 systemd[1]: Stopping User Slice of root. Dec 20 07:29:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:29:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 25 previous similar messages Dec 20 07:29:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.225@o2ib10 (1): c: 0, oc: 0, rc: 63 Dec 20 07:29:47 mds14 kernel: LNetError: 71167:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 25 previous similar messages Dec 20 07:30:08 mds14 kernel: Lustre: scratch0-MDT0000-osp-MDT0001: Connection to scratch0-MDT0000 (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:30:08 mds14 kernel: Lustre: Skipped 9 previous similar messages Dec 20 07:30:31 mds14 kernel: LustreError: 166-1: MGC10.0.11.225@o2ib10: Connection to MGS (at 10.0.11.225@o2ib10) was lost; in progress operations using this service will fail Dec 20 07:30:56 mds14 kernel: LustreError: 137-5: scratch0-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:30:56 mds14 kernel: LustreError: Skipped 9 previous similar messages Dec 20 07:32:54 mds14 systemd[1]: Created slice User Slice of root. Dec 20 07:32:54 mds14 systemd[1]: Starting User Slice of root. Dec 20 07:32:54 mds14 systemd-logind[2963]: New session 330 of user root. Dec 20 07:32:54 mds14 systemd[1]: Started Session 330 of user root. Dec 20 07:32:54 mds14 systemd[1]: Starting Session 330 of user root.