Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:18 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:02:18 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:02:19 mds13 kernel: Lustre: Skipped 29 previous similar messages Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257025/real 0] req@ffff91e9f7715700 x1620318877676672/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257036 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:19 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 20 07:02:19 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:02:24 mds13 kernel: Lustre: 72025:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257029/real 0] req@ffff91d59caeb300 x1620318877862928/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257040 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:02:24 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:24 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 3 previous similar messages Dec 20 07:02:24 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (5): c: 0, oc: 0, rc: 63 Dec 20 07:02:24 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 3 previous similar messages Dec 20 07:02:31 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Dec 20 07:02:31 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 2 previous similar messages Dec 20 07:02:31 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (7): c: 0, oc: 0, rc: 63 Dec 20 07:02:31 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 2 previous similar messages Dec 20 07:02:32 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:02:32 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:02:37 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:37 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:02:43 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:43 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:02:43 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:02:43 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:02:53 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:02:53 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:02:54 mds13 kernel: Lustre: MGS: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:02:54 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:02:56 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:02:56 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 4 previous similar messages Dec 20 07:02:56 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (3): c: 0, oc: 1, rc: 63 Dec 20 07:02:56 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 4 previous similar messages Dec 20 07:03:00 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:03:00 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:03:12 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:03:12 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 2 previous similar messages Dec 20 07:03:12 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:03:12 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 2 previous similar messages Dec 20 07:03:13 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:03:13 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:03:19 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:03:25 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:03:25 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:03:35 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:03:39 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:03:39 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:03:39 mds13 kernel: Lustre: Skipped 2 previous similar messages Dec 20 07:03:40 mds13 kernel: LustreError: 75275:0:(ldlm_lib.c:3258:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff91ea283ae450 x1620318883868224/t0(0) o1000->scratch0-MDT0001-mdtlov_UUID@10.0.11.226@o2ib10:143/0 lens 368/0 e 1 to 0 dl 1545257133 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Dec 20 07:03:49 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:03:49 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 11 previous similar messages Dec 20 07:03:49 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (6): c: 0, oc: 0, rc: 63 Dec 20 07:03:49 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 11 previous similar messages Dec 20 07:03:50 mds13 kernel: LNetError: 71965:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Dec 20 07:03:50 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:03:50 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 7 previous similar messages Dec 20 07:03:50 mds13 kernel: LustreError: 71965:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91e8f7eb0000 Dec 20 07:03:50 mds13 kernel: Lustre: MGS: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:04:11 mds13 kernel: Lustre: 72025:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257141/real 1545257141] req@ffff91eb70e46f00 x1620318886559904/t0(0) o41->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 224/368 e 0 to 1 dl 1545257148 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:04:11 mds13 kernel: Lustre: 72025:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1099 previous similar messages Dec 20 07:04:11 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:04:15 mds13 kernel: Lustre: 72012:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257145/real 1545257145] req@ffff91e9303b5400 x1620318886701872/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257152 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:04:15 mds13 kernel: Lustre: 72012:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2115 previous similar messages Dec 20 07:04:20 mds13 kernel: Lustre: 72012:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257145/real 1545257145] req@ffff91eafc87ce00 x1620318886701968/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257155 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:04:20 mds13 kernel: Lustre: 72012:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Dec 20 07:04:36 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:04:58 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:04:59 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:04:59 mds13 kernel: Lustre: Skipped 3 previous similar messages Dec 20 07:04:59 mds13 kernel: Lustre: scratch0-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:04:59 mds13 kernel: Lustre: Skipped 5 previous similar messages Dec 20 07:05:05 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds Dec 20 07:05:05 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 2 previous similar messages Dec 20 07:05:05 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (1): c: 0, oc: 1, rc: 63 Dec 20 07:05:05 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 2 previous similar messages Dec 20 07:05:10 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:05:10 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 3 previous similar messages Dec 20 07:05:51 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:05:54 mds13 kernel: Lustre: MGS: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:05:54 mds13 kernel: Lustre: Skipped 3 previous similar messages Dec 20 07:06:41 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:07:07 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:07:07 mds13 kernel: Lustre: Skipped 8 previous similar messages Dec 20 07:07:14 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257323/real 1545257323] req@ffff91e133edb600 x1620318895271824/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 4072/4320 e 0 to 1 dl 1545257330 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:07:14 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 748 previous similar messages Dec 20 07:07:14 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:07:24 mds13 kernel: LNet: Service thread pid 73522 was inactive for 200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 20 07:07:24 mds13 kernel: Pid: 73522, comm: mdt01_000 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:07:24 mds13 kernel: Call Trace: Dec 20 07:07:24 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:07:24 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:07:24 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:07:24 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:07:24 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:07:24 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:07:24 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:07:24 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:07:24 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:07:24 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:07:24 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:07:24 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:07:24 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:07:24 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:07:24 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:07:24 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:07:24 mds13 kernel: LustreError: dumping log to /tmp/lustre-log.1545257341.73522 Dec 20 07:07:26 mds13 kernel: LNet: Service thread pid 74420 was inactive for 202.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 20 07:07:26 mds13 kernel: Pid: 74420, comm: mdt09_006 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:07:26 mds13 kernel: Call Trace: Dec 20 07:07:26 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:07:26 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:07:26 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:07:26 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:07:26 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:07:26 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:07:26 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:07:26 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:07:26 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:07:26 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:07:26 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:07:26 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:07:26 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:07:26 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:07:26 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:07:26 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:07:39 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:08:00 mds13 kernel: LNet: Service thread pid 73863 was inactive for 236.28s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 20 07:08:00 mds13 kernel: Pid: 73863, comm: mdt07_004 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:08:00 mds13 kernel: Call Trace: Dec 20 07:08:00 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:08:00 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:08:00 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:08:00 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:08:00 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:08:00 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:08:00 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:08:00 mds13 kernel: LustreError: dumping log to /tmp/lustre-log.1545257377.73863 Dec 20 07:08:00 mds13 kernel: Pid: 74444, comm: mdt02_010 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:08:00 mds13 kernel: Call Trace: Dec 20 07:08:00 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:08:00 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:08:00 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:08:00 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:08:00 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:08:00 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:08:00 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:08:00 mds13 kernel: Pid: 74441, comm: mdt03_009 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:08:00 mds13 kernel: Call Trace: Dec 20 07:08:00 mds13 kernel: [] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] osp_md_object_lock+0x162/0x2d0 [osp] Dec 20 07:08:00 mds13 kernel: [] lod_object_lock+0x77b/0x7b0 [lod] Dec 20 07:08:00 mds13 kernel: [] mdd_object_lock+0x3e/0xe0 [mdd] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_striped_lock+0x29e/0x510 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_create+0xc8a/0x1090 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:08:00 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:08:00 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:08:00 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:08:00 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:08:00 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:08:00 mds13 kernel: LNet: Service thread pid 74514 was inactive for 236.37s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Dec 20 07:08:04 mds13 kernel: LNet: Service thread pid 74441 completed after 239.79s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Dec 20 07:08:14 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:08:14 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 23 previous similar messages Dec 20 07:08:14 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (4): c: 0, oc: 1, rc: 63 Dec 20 07:08:14 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 23 previous similar messages Dec 20 07:08:16 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1545257386/real 0] req@ffff91ea137b9b00 x1620318896248384/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 368/4320 e 0 to 1 dl 1545257393 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:08:16 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 551 previous similar messages Dec 20 07:08:16 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:08:20 mds13 kernel: LNet: 75931:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:08:20 mds13 kernel: LNet: 75931:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 19 previous similar messages Dec 20 07:10:01 mds13 systemd[1]: Started Session 330 of user root. Dec 20 07:10:01 mds13 systemd[1]: Starting Session 330 of user root. Dec 20 07:08:33 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:08:33 mds13 kernel: Lustre: Skipped 4 previous similar messages Dec 20 07:09:06 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:09:41 mds13 kernel: Lustre: 72005:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1545257477/real 1545257478] req@ffff91de766b9200 x1620318897172000/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257484 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 Dec 20 07:09:41 mds13 kernel: Lustre: 72005:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 20 07:09:41 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:09:53 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:10:19 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:11:21 mds13 kernel: Lustre: MGS: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:11:21 mds13 kernel: Lustre: Skipped 5 previous similar messages Dec 20 07:11:26 mds13 kernel: LNet: 75931:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:11:26 mds13 kernel: LNet: 75931:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 15 previous similar messages Dec 20 07:13:05 mds13 systemd-logind[2938]: New session 331 of user root. Dec 20 07:13:05 mds13 systemd[1]: Started Session 331 of user root. Dec 20 07:13:05 mds13 systemd[1]: Starting Session 331 of user root. Dec 20 07:12:06 mds13 kernel: Lustre: MGS: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:12:06 mds13 kernel: Lustre: Skipped 11 previous similar messages Dec 20 07:12:49 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:12:49 mds13 kernel: LustreError: Skipped 1 previous similar message Dec 20 07:13:01 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:13:01 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 43 previous similar messages Dec 20 07:13:01 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (0): c: 0, oc: 0, rc: 63 Dec 20 07:13:01 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 43 previous similar messages Dec 20 07:13:06 mds13 kernel: LNet: Service thread pid 74586 was inactive for 200.39s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 20 07:13:06 mds13 kernel: LNet: Skipped 2 previous similar messages Dec 20 07:13:06 mds13 kernel: Pid: 74586, comm: mdt05_010 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:13:06 mds13 kernel: Call Trace: Dec 20 07:13:06 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:13:06 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:13:06 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:13:06 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:13:06 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:13:06 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:13:06 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:13:06 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:13:06 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:13:06 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:13:06 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:13:06 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:13:06 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:13:06 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:13:06 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:13:06 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:13:06 mds13 kernel: LustreError: dumping log to /tmp/lustre-log.1545257683.74586 Dec 20 07:13:07 mds13 kernel: Pid: 74416, comm: mdt10_006 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:13:07 mds13 kernel: Call Trace: Dec 20 07:13:07 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:13:07 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:13:07 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:13:07 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:13:07 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:13:07 mds13 kernel: Pid: 74456, comm: mdt10_010 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:13:07 mds13 kernel: Call Trace: Dec 20 07:13:07 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:13:07 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:13:07 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:13:07 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:13:07 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:13:07 mds13 kernel: Pid: 74533, comm: mdt10_019 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:13:07 mds13 kernel: Call Trace: Dec 20 07:13:07 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:13:07 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:13:07 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:13:07 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:13:07 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:13:07 mds13 kernel: Pid: 74496, comm: mdt10_013 3.10.0-862.9.1.el7_lustre.ddn1.x86_64 #1 SMP Tue Sep 11 19:05:37 JST 2018 Dec 20 07:13:07 mds13 kernel: Call Trace: Dec 20 07:13:07 mds13 kernel: [] top_trans_wait_result+0xa6/0x155 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] top_trans_stop+0x4e9/0xa70 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] lod_trans_stop+0x25c/0x340 [lod] Dec 20 07:13:07 mds13 kernel: [] mdd_trans_stop+0x2e/0x174 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdd_create+0x1151/0x1440 [mdd] Dec 20 07:13:07 mds13 kernel: [] mdt_create+0xb54/0x1090 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_create+0x16b/0x360 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Dec 20 07:13:07 mds13 kernel: [] mdt_reint+0x67/0x140 [mdt] Dec 20 07:13:07 mds13 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] Dec 20 07:13:07 mds13 kernel: [] kthread+0xd1/0xe0 Dec 20 07:13:07 mds13 kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Dec 20 07:13:07 mds13 kernel: [] 0xffffffffffffffff Dec 20 07:13:07 mds13 kernel: LNet: Service thread pid 74524 was inactive for 200.83s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Dec 20 07:13:07 mds13 kernel: LNet: Skipped 1 previous similar message Dec 20 07:13:42 mds13 kernel: LNet: Service thread pid 74423 was inactive for 236.23s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Dec 20 07:13:42 mds13 kernel: LNet: Skipped 20 previous similar messages Dec 20 07:13:42 mds13 kernel: LustreError: dumping log to /tmp/lustre-log.1545257719.74423 Dec 20 07:14:30 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:14:31 mds13 kernel: LNet: Service thread pid 74539 completed after 284.61s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Dec 20 07:14:31 mds13 kernel: LNet: Skipped 6 previous similar messages Dec 20 07:14:43 mds13 kernel: Lustre: 74538:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257773/real 1545257773] req@ffff91de7ceaef00 x1620318898876832/t0(0) o101->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 328/344 e 0 to 1 dl 1545257780 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:14:43 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257773/real 1545257773] req@ffff91e8c791aa00 x1620318898876816/t0(0) o1000->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:24/4 lens 4072/4320 e 0 to 1 dl 1545257780 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:14:43 mds13 kernel: Lustre: 73687:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 5127 previous similar messages Dec 20 07:14:43 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:15:06 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:15:40 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:15:47 mds13 kernel: Lustre: MGS: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:15:47 mds13 kernel: Lustre: Skipped 14 previous similar messages Dec 20 07:17:17 mds13 kernel: Lustre: 72008:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545257923/real 1545257923] req@ffff91e8bf1c6600 x1620318903747472/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545257934 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:17:17 mds13 kernel: Lustre: 72008:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2228 previous similar messages Dec 20 07:17:17 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:17:42 mds13 kernel: LustreError: 137-5: scratch0-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 20 07:17:42 mds13 kernel: LustreError: Skipped 3 previous similar messages Dec 20 07:17:47 mds13 kernel: LNet: 71968:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.0.11.226@o2ib10 - queue depth reduced from 128 to 63 to allow for qp creation Dec 20 07:18:07 mds13 kernel: LNet: 67774:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 51 previous similar messages Dec 20 07:20:01 mds13 systemd[1]: Started Session 332 of user root. Dec 20 07:20:01 mds13 systemd[1]: Starting Session 332 of user root. Dec 20 07:20:40 mds13 systemd-logind[2938]: Removed session 331. Dec 20 07:19:40 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:19:40 mds13 kernel: Lustre: Skipped 1 previous similar message Dec 20 07:20:06 mds13 kernel: LNet: 71965:0:(o2iblnd_cb.c:1484:kiblnd_reconnect_peer()) Abort reconnection of 10.0.11.226@o2ib10: connected Dec 20 07:20:12 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:21:52 mds13 systemd-logind[2938]: New session 333 of user root. Dec 20 07:21:52 mds13 systemd[1]: Started Session 333 of user root. Dec 20 07:21:52 mds13 systemd[1]: Starting Session 333 of user root. Dec 20 07:21:13 mds13 kernel: Lustre: MGS: Connection restored to 10.0.11.226@o2ib10 (at 10.0.11.226@o2ib10) Dec 20 07:21:13 mds13 kernel: Lustre: Skipped 33 previous similar messages Dec 20 07:21:38 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds Dec 20 07:21:38 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Skipped 57 previous similar messages Dec 20 07:21:38 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.11.226@o2ib10 (0): c: 0, oc: 0, rc: 63 Dec 20 07:21:38 mds13 kernel: LNetError: 71965:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Skipped 57 previous similar messages Dec 20 07:22:53 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10 Dec 20 07:22:53 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) Skipped 1 previous similar message Dec 20 07:24:13 mds13 kernel: Lustre: 72006:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1545258342/real 1545258342] req@ffff91e8d69a8000 x1620318906789504/t0(0) o103->scratch0-MDT0001-osp-MDT0000@10.0.11.226@o2ib10:17/18 lens 328/224 e 0 to 1 dl 1545258349 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Dec 20 07:24:13 mds13 kernel: Lustre: 72006:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4136 previous similar messages Dec 20 07:24:13 mds13 kernel: Lustre: scratch0-MDT0001-osp-MDT0000: Connection to scratch0-MDT0001 (at 10.0.11.226@o2ib10) was lost; in progress operations using this service will wait for recovery to complete Dec 20 07:24:13 mds13 kernel: Lustre: Skipped 3 previous similar messages Dec 20 07:24:23 mds13 kernel: Lustre: scratch0-MDT0000: Received new LWP connection from 10.0.11.226@o2ib10, removing former export from same NID Dec 20 07:24:23 mds13 kernel: Lustre: Skipped 20 previous similar messages Dec 20 07:25:48 mds13 kernel: LNet: 71969:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.0.11.226@o2ib10