-- Logs begin at Mon 2021-06-14 19:39:22 BST, end at Mon 2021-09-06 19:34:00 BST. --
Sep 06 00:00:24 csd3-mds2 kernel: Lustre: 5456:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-4876), not sending early reply
                                    req@ffff9f750bb65a00 x1709712460511104/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:534/0 lens 264/224 e 1 to 0 dl 1630882829 ref 2 fl Interpret:/2/0 rc 0/0
Sep 06 00:01:21 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 00:01:21 csd3-mds2 kernel: Lustre: Skipped 400 previous similar messages
Sep 06 00:06:15 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f7f79b7c800, cur 1630883175 expire 1630883025 last 1630882948
Sep 06 00:06:20 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:06:20 csd3-mds2 kernel: LustreError: Skipped 91 previous similar messages
Sep 06 00:08:36 csd3-mds2 kernel: Lustre: 26803:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630883305/real 1630883305]  req@ffff9f8b66ffbf00 x1709807674770240/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630883316 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:09:25 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 00:09:25 csd3-mds2 kernel: Lustre: Skipped 250 previous similar messages
Sep 06 00:09:32 csd3-mds2 kernel: Lustre: 26881:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630883365/real 1630883365]  req@ffff9f8b49be9f80 x1709807674851456/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630883372 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:09:32 csd3-mds2 kernel: Lustre: 26881:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 00:11:22 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 00:11:22 csd3-mds2 kernel: Lustre: Skipped 209 previous similar messages
Sep 06 00:11:58 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88cb2f6400, cur 1630883518 expire 1630883368 last 1630883291
Sep 06 00:14:07 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f82390e5400, cur 1630883647 expire 1630883497 last 1630883420
Sep 06 00:16:21 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:16:21 csd3-mds2 kernel: LustreError: Skipped 120 previous similar messages
Sep 06 00:16:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
Sep 06 00:16:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 00:18:04 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) in 206 seconds. I think it's dead, and I am evicting it. exp ffff9fa665920c00, cur 1630883884 expire 1630883734 last 1630883678
Sep 06 00:19:26 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 00:19:26 csd3-mds2 kernel: Lustre: Skipped 180 previous similar messages
Sep 06 00:19:38 csd3-mds2 kernel: Lustre: 26841:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630883971/real 1630883971]  req@ffff9f74fe195e80 x1709807675711296/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630883978 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:21:04 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f7e629fa1c0/0x41c5233e3043d8c3 lrc: 3/0,0 mode: PR/PR res: [0x200031ff7:0x82:0x0].0x0 bits 0x1b/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.43.240.198@tcp2 remote: 0xcbc3807f65793d17 expref: 27 pid: 26774 timeout: 379098 lvb_type: 0
Sep 06 00:21:28 csd3-mds2 kernel: Lustre: 26747:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 00:21:35 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 00:21:35 csd3-mds2 kernel: Lustre: Skipped 157 previous similar messages
Sep 06 00:22:41 csd3-mds2 kernel: Lustre: 26767:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630884154/real 1630884154]  req@ffff9f74ba0bec00 x1709807676007360/t0(0) o104->rds-d4-MDT0000@10.47.20.186@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630884161 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:25:21 csd3-mds2 kernel: LNet: Service thread pid 26747 was inactive for 232.79s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 00:25:21 csd3-mds2 kernel: Pid: 26747, comm: mdt01_012 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 00:25:21 csd3-mds2 kernel: Call Trace:
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 00:25:21 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 00:25:21 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630884321.26747
Sep 06 00:26:22 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:26:22 csd3-mds2 kernel: LustreError: Skipped 102 previous similar messages
Sep 06 00:28:24 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 175 seconds. I think it's dead, and I am evicting it. exp ffff9fa340293c00, cur 1630884504 expire 1630884354 last 1630884329
Sep 06 00:29:27 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 00:29:27 csd3-mds2 kernel: Lustre: Skipped 86 previous similar messages
Sep 06 00:31:37 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client a2647704-cc85-a7e0-0bf2-95d98f0c7b96 (at 10.43.240.198@tcp2) reconnecting
Sep 06 00:31:37 csd3-mds2 kernel: Lustre: Skipped 126 previous similar messages
Sep 06 00:33:08 csd3-mds2 kernel: Lustre: 26770:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630884781/real 1630884781]  req@ffff9f772b85d580 x1709807676928512/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630884788 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:33:41 csd3-mds2 kernel: LustreError: 26747:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 00:33:41 csd3-mds2 kernel: LustreError: 26747:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 7482 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 00:33:41 csd3-mds2 kernel: LNet: Service thread pid 26747 completed after 732.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 00:33:41 csd3-mds2 kernel: Lustre: 26782:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 00:36:27 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:36:27 csd3-mds2 kernel: LustreError: Skipped 79 previous similar messages
Sep 06 00:37:01 csd3-mds2 kernel: LNet: Service thread pid 26782 was inactive for 200.67s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 00:37:01 csd3-mds2 kernel: Pid: 26782, comm: mdt00_032 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 00:37:01 csd3-mds2 kernel: Call Trace:
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 00:37:01 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 00:37:01 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630885021.26782
Sep 06 00:39:28 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 00:39:28 csd3-mds2 kernel: Lustre: Skipped 173 previous similar messages
Sep 06 00:40:14 csd3-mds2 kernel: LustreError: 26862:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 00:40:14 csd3-mds2 kernel: LustreError: 26862:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 7494 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 00:40:14 csd3-mds2 kernel: Lustre: 26862:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (5481:2385s); client may timeout.  req@ffff9f750bb65a00 x1709712460511104/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:534/0 lens 264/192 e 1 to 0 dl 1630882829 ref 1 fl Complete:/2/0 rc -2/-2
Sep 06 00:40:14 csd3-mds2 kernel: LNet: Service thread pid 26862 completed after 7866.07s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 00:41:02 csd3-mds2 kernel: Lustre: 26781:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630885255/real 1630885255]  req@ffff9f8b59fd5e80 x1709807677597632/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630885262 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 00:41:37 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 00:41:37 csd3-mds2 kernel: Lustre: Skipped 155 previous similar messages
Sep 06 00:42:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
Sep 06 00:42:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 6, rc: 32
Sep 06 00:46:29 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:46:29 csd3-mds2 kernel: LustreError: Skipped 92 previous similar messages
Sep 06 00:49:30 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to f86335b5-e0af-3335-6f06-5b8b23fa282d (at 10.47.20.228@o2ib1)
Sep 06 00:49:30 csd3-mds2 kernel: Lustre: Skipped 225 previous similar messages
Sep 06 00:50:30 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 80560c5c-3c7e-04a3-df3d-a9dcfe515124 (at 10.43.240.199@tcp2) in 156 seconds. I think it's dead, and I am evicting it. exp ffff9fa679000c00, cur 1630885830 expire 1630885680 last 1630885674
Sep 06 00:51:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 00:51:41 csd3-mds2 kernel: Lustre: Skipped 218 previous similar messages
Sep 06 00:56:33 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 00:56:33 csd3-mds2 kernel: LustreError: Skipped 72 previous similar messages
Sep 06 00:59:31 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 00:59:31 csd3-mds2 kernel: Lustre: Skipped 349 previous similar messages
Sep 06 01:01:42 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) reconnecting
Sep 06 01:01:42 csd3-mds2 kernel: Lustre: Skipped 377 previous similar messages
Sep 06 01:06:44 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:06:44 csd3-mds2 kernel: LustreError: Skipped 52 previous similar messages
Sep 06 01:09:34 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 01:09:34 csd3-mds2 kernel: Lustre: Skipped 296 previous similar messages
Sep 06 01:11:42 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 01:11:42 csd3-mds2 kernel: Lustre: Skipped 318 previous similar messages
Sep 06 01:16:47 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:16:47 csd3-mds2 kernel: LustreError: Skipped 83 previous similar messages
Sep 06 01:19:35 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 01:19:35 csd3-mds2 kernel: Lustre: Skipped 416 previous similar messages
Sep 06 01:19:45 csd3-mds2 kernel: Lustre: 26804:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630887578/real 1630887578]  req@ffff9f6a92ddb180 x1709807681481216/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630887585 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 01:19:45 csd3-mds2 kernel: Lustre: 26804:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Sep 06 01:21:45 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 01:21:45 csd3-mds2 kernel: Lustre: Skipped 434 previous similar messages
Sep 06 01:23:42 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) in 180 seconds. I think it's dead, and I am evicting it. exp ffff9f88c8f6a800, cur 1630887822 expire 1630887672 last 1630887642
Sep 06 01:24:28 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa33dd47000, cur 1630887868 expire 1630887718 last 1630887641
Sep 06 01:24:56 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 01:24:56 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 01:26:49 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:26:49 csd3-mds2 kernel: LustreError: Skipped 121 previous similar messages
Sep 06 01:29:36 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 6710a842-ab40-c94c-f724-b11fe70bf02d (at 10.47.0.211@o2ib1)
Sep 06 01:29:36 csd3-mds2 kernel: Lustre: Skipped 275 previous similar messages
Sep 06 01:30:46 csd3-mds2 kernel: Lustre: 26831:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630888239/real 1630888239]  req@ffff9f992d1cc800 x1709807682384640/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630888246 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 01:30:46 csd3-mds2 kernel: Lustre: 26831:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Sep 06 01:31:46 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 01:31:46 csd3-mds2 kernel: Lustre: Skipped 239 previous similar messages
Sep 06 01:37:15 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:37:15 csd3-mds2 kernel: LustreError: Skipped 71 previous similar messages
Sep 06 01:39:37 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 01:39:37 csd3-mds2 kernel: Lustre: Skipped 337 previous similar messages
Sep 06 01:41:47 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 01:41:47 csd3-mds2 kernel: Lustre: Skipped 324 previous similar messages
Sep 06 01:43:02 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 01:43:02 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 01:43:57 csd3-mds2 kernel: Lustre: 26548:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630889030/real 1630889030]  req@ffff9f99f3de7080 x1709807683591296/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630889037 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 01:43:57 csd3-mds2 kernel: Lustre: 26548:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 01:44:05 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 01:44:05 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 6, rc: 32
Sep 06 01:47:17 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:47:17 csd3-mds2 kernel: LustreError: Skipped 121 previous similar messages
Sep 06 01:48:56 csd3-mds2 kernel: Lustre: 26811:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630889329/real 1630889329]  req@ffff9f994fd00000 x1709807684067264/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630889336 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 01:48:56 csd3-mds2 kernel: Lustre: 26811:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 01:49:43 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 01:49:43 csd3-mds2 kernel: Lustre: Skipped 354 previous similar messages
Sep 06 01:51:51 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 01:51:51 csd3-mds2 kernel: Lustre: Skipped 321 previous similar messages
Sep 06 01:57:17 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 01:57:17 csd3-mds2 kernel: LustreError: Skipped 159 previous similar messages
Sep 06 01:57:19 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa6787ae400, cur 1630889839 expire 1630889689 last 1630889612
Sep 06 01:59:06 csd3-mds2 kernel: Lustre: 26687:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630889939/real 1630889939]  req@ffff9f747e17ba80 x1709807685069824/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630889946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 01:59:06 csd3-mds2 kernel: Lustre: 26687:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 01:59:44 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 01:59:44 csd3-mds2 kernel: Lustre: Skipped 275 previous similar messages
Sep 06 02:01:51 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 02:01:51 csd3-mds2 kernel: Lustre: Skipped 327 previous similar messages
Sep 06 02:03:48 csd3-mds2 kernel: Lustre: 26755:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630890221/real 1630890221]  req@ffff9f746f7f5e80 x1709807685461952/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630890228 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 02:06:27 csd3-mds2 kernel: Lustre: 26780:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630890380/real 1630890380]  req@ffff9f74fe097980 x1709807685810048/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630890387 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 02:07:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds
Sep 06 02:07:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 15, rc: 32
Sep 06 02:07:20 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.47.20.154@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:07:20 csd3-mds2 kernel: LustreError: Skipped 131 previous similar messages
Sep 06 02:09:45 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 02:09:45 csd3-mds2 kernel: Lustre: Skipped 275 previous similar messages
Sep 06 02:11:53 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 02:11:53 csd3-mds2 kernel: Lustre: Skipped 266 previous similar messages
Sep 06 02:17:23 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:17:23 csd3-mds2 kernel: LustreError: Skipped 86 previous similar messages
Sep 06 02:19:49 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 02:19:49 csd3-mds2 kernel: Lustre: Skipped 371 previous similar messages
Sep 06 02:21:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client a2647704-cc85-a7e0-0bf2-95d98f0c7b96 (at 10.43.240.198@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f76c2bc5c00, cur 1630891261 expire 1630891111 last 1630891034
Sep 06 02:21:32 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 02:21:32 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 10, rc: 32
Sep 06 02:21:34 csd3-mds2 kernel: Lustre: 26841:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 02:21:57 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 02:21:57 csd3-mds2 kernel: Lustre: Skipped 369 previous similar messages
Sep 06 02:22:22 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 02:22:22 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 13, rc: 32
Sep 06 02:24:55 csd3-mds2 kernel: LNet: Service thread pid 26841 was inactive for 200.45s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 02:24:55 csd3-mds2 kernel: Pid: 26841, comm: mdt00_058 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 02:24:55 csd3-mds2 kernel: Call Trace:
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 02:24:55 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 02:24:55 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630891494.26841
Sep 06 02:27:27 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:27:27 csd3-mds2 kernel: LustreError: Skipped 187 previous similar messages
Sep 06 02:28:28 csd3-mds2 kernel: Lustre: 26768:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630891701/real 1630891701]  req@ffff9f98e3805100 x1709807687704000/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630891708 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 02:28:28 csd3-mds2 kernel: Lustre: 26768:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 02:28:39 csd3-mds2 kernel: Lustre: 13821:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-6298), not sending early reply
                                    req@ffff9f7470be7500 x1709712547701120/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:369/0 lens 264/224 e 0 to 0 dl 1630891724 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 02:29:51 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 02:29:51 csd3-mds2 kernel: Lustre: Skipped 290 previous similar messages
Sep 06 02:31:57 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 02:31:57 csd3-mds2 kernel: Lustre: Skipped 260 previous similar messages
Sep 06 02:32:52 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b53fc8000, cur 1630891972 expire 1630891822 last 1630891745
Sep 06 02:35:32 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b3ebcfc00, cur 1630892132 expire 1630891982 last 1630891905
Sep 06 02:37:32 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:37:32 csd3-mds2 kernel: LustreError: Skipped 131 previous similar messages
Sep 06 02:38:30 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa33c9dc000, cur 1630892310 expire 1630892160 last 1630892083
Sep 06 02:39:31 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8a431f4800, cur 1630892371 expire 1630892221 last 1630892144
Sep 06 02:39:55 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 02:39:55 csd3-mds2 kernel: Lustre: Skipped 353 previous similar messages
Sep 06 02:41:58 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 02:41:58 csd3-mds2 kernel: Lustre: Skipped 382 previous similar messages
Sep 06 02:47:00 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
Sep 06 02:47:00 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 10, rc: 32
Sep 06 02:48:05 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:48:05 csd3-mds2 kernel: LustreError: Skipped 43 previous similar messages
Sep 06 02:49:59 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cad937f8-511a-db1a-ba4e-ef689cb6b6e3 (at 10.47.1.79@o2ib1)
Sep 06 02:49:59 csd3-mds2 kernel: Lustre: Skipped 316 previous similar messages
Sep 06 02:50:47 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 02:50:47 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 02:51:58 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 7fcd3028-4447-b1e6-b139-e4baffbb87b4 (at 10.47.20.73@o2ib1) reconnecting
Sep 06 02:51:58 csd3-mds2 kernel: Lustre: Skipped 288 previous similar messages
Sep 06 02:56:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 02:56:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 10, rc: 32
Sep 06 02:58:18 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 02:58:18 csd3-mds2 kernel: LustreError: Skipped 86 previous similar messages
Sep 06 02:59:30 csd3-mds2 kernel: Lustre: 13789:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630893563/real 1630893563]  req@ffff9f98c13c7500 x1709807690379008/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630893570 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 02:59:30 csd3-mds2 kernel: Lustre: 13789:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 03:00:06 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 6b0c7288-17e4-9094-4d86-7cb2d8391a64 (at 10.47.20.199@o2ib1)
Sep 06 03:00:06 csd3-mds2 kernel: Lustre: Skipped 401 previous similar messages
Sep 06 03:01:06 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 03:01:06 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 03:01:59 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 03:01:59 csd3-mds2 kernel: Lustre: Skipped 402 previous similar messages
Sep 06 03:06:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 03:06:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 7, rc: 32
Sep 06 03:06:11 csd3-mds2 kernel: LustreError: 26782:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 03:06:11 csd3-mds2 kernel: LustreError: 26782:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 7847 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 03:06:11 csd3-mds2 kernel: Lustre: 26782:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6903:2247s); client may timeout.  req@ffff9f7470be7500 x1709712547701120/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:369/0 lens 264/192 e 0 to 0 dl 1630891724 ref 1 fl Complete:/0/0 rc -2/-2
Sep 06 03:06:11 csd3-mds2 kernel: LNet: Service thread pid 26782 completed after 9150.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 03:08:31 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:08:31 csd3-mds2 kernel: LustreError: Skipped 95 previous similar messages
Sep 06 03:10:08 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 03:10:08 csd3-mds2 kernel: Lustre: Skipped 248 previous similar messages
Sep 06 03:10:59 csd3-mds2 kernel: Lustre: 26763:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-2365), not sending early reply
                                    req@ffff9f73eacdd580 x1709712633557440/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:644/0 lens 264/224 e 1 to 0 dl 1630894264 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 03:12:00 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 268f7440-e2d4-ae66-f322-a2eeabb9a6ba (at 10.47.0.224@o2ib1) reconnecting
Sep 06 03:12:00 csd3-mds2 kernel: Lustre: Skipped 265 previous similar messages
Sep 06 03:13:18 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 7 seconds
Sep 06 03:13:18 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 2, rc: 32
Sep 06 03:18:33 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:18:33 csd3-mds2 kernel: LustreError: Skipped 161 previous similar messages
Sep 06 03:20:10 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 03:20:10 csd3-mds2 kernel: Lustre: Skipped 409 previous similar messages
Sep 06 03:21:20 csd3-mds2 kernel: Lustre: 26822:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630894873/real 1630894873]  req@ffff9f98ae5a0480 x1709807692513920/t0(0) o104->rds-d5-MDT0000@10.47.20.88@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630894880 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 03:21:28 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88c8ee8c00, cur 1630894888 expire 1630894738 last 1630894661
Sep 06 03:22:00 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 03:22:00 csd3-mds2 kernel: Lustre: Skipped 417 previous similar messages
Sep 06 03:24:27 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 03:24:27 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 03:26:25 csd3-mds2 kernel: Lustre: 26887:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630895174/real 1630895174]  req@ffff9f739007ec00 x1709807692950272/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630895185 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 03:26:36 csd3-mds2 kernel: Lustre: 26887:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630895185/real 1630895185]  req@ffff9f739007ec00 x1709807692950272/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630895196 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 03:26:47 csd3-mds2 kernel: Lustre: 26887:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630895196/real 1630895196]  req@ffff9f739007ec00 x1709807692950272/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630895207 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 03:28:34 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:28:34 csd3-mds2 kernel: LustreError: Skipped 147 previous similar messages
Sep 06 03:29:29 csd3-mds2 kernel: Lustre: 13798:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630895362/real 1630895362]  req@ffff9f76a9c91b00 x1709807693196288/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630895369 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 03:30:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 03:30:13 csd3-mds2 kernel: Lustre: Skipped 455 previous similar messages
Sep 06 03:30:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 03:30:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 0, rc: 32
Sep 06 03:31:09 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 107s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f8183cfb3c0/0x41c5233e8305b0cf lrc: 3/0,0 mode: PR/PR res: [0x20001a08f:0x17ce8:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c0ecfe0f expref: 208 pid: 26784 timeout: 390503 lvb_type: 0
Sep 06 03:32:01 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 7f7f8f64-e51b-4483-adff-0e452ec6312a (at 10.47.20.128@o2ib1) reconnecting
Sep 06 03:32:01 csd3-mds2 kernel: Lustre: Skipped 474 previous similar messages
Sep 06 03:38:36 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:38:36 csd3-mds2 kernel: LustreError: Skipped 157 previous similar messages
Sep 06 03:40:16 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 03:40:16 csd3-mds2 kernel: Lustre: Skipped 470 previous similar messages
Sep 06 03:42:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 03:42:01 csd3-mds2 kernel: Lustre: Skipped 466 previous similar messages
Sep 06 03:42:54 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88ca773800, cur 1630896174 expire 1630896024 last 1630895947
Sep 06 03:46:33 csd3-mds2 kernel: Lustre: 13786:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630896386/real 1630896386]  req@ffff9f988f8b7500 x1709807694691520/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630896393 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 03:48:53 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:48:53 csd3-mds2 kernel: LustreError: Skipped 112 previous similar messages
Sep 06 03:50:17 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 03:50:17 csd3-mds2 kernel: Lustre: Skipped 371 previous similar messages
Sep 06 03:52:02 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 2a214f50-ca96-fdac-9441-84929b0aeeea (at 10.47.1.71@o2ib1) reconnecting
Sep 06 03:52:02 csd3-mds2 kernel: Lustre: Skipped 322 previous similar messages
Sep 06 03:55:31 csd3-mds2 kernel: Lustre: 5453:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630896924/real 1630896924]  req@ffff9f74ba0be300 x1709807695451840/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630896931 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 03:59:05 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 03:59:05 csd3-mds2 kernel: LustreError: Skipped 116 previous similar messages
Sep 06 04:00:18 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 04:00:18 csd3-mds2 kernel: Lustre: Skipped 311 previous similar messages
Sep 06 04:02:04 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 04:02:04 csd3-mds2 kernel: Lustre: Skipped 294 previous similar messages
Sep 06 04:04:38 csd3-mds2 kernel: Lustre: 26758:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630897471/real 1630897471]  req@ffff9f88ca268d80 x1709807696205184/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630897478 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:09:26 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:09:26 csd3-mds2 kernel: LustreError: Skipped 96 previous similar messages
Sep 06 04:10:21 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 04:10:21 csd3-mds2 kernel: Lustre: Skipped 228 previous similar messages
Sep 06 04:12:05 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 04:12:05 csd3-mds2 kernel: Lustre: Skipped 269 previous similar messages
Sep 06 04:19:29 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:19:29 csd3-mds2 kernel: LustreError: Skipped 96 previous similar messages
Sep 06 04:19:45 csd3-mds2 kernel: Lustre: 13809:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898378/real 1630898378]  req@ffff9f829f6ada00 x1709807697717056/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898385 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:19:52 csd3-mds2 kernel: Lustre: 26810:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898385/real 1630898385]  req@ffff9f9858ceb180 x1709807697730688/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898392 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:20:21 csd3-mds2 kernel: Lustre: 13789:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898414/real 1630898414]  req@ffff9f8297353180 x1709807697777088/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898421 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:20:21 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 04:20:21 csd3-mds2 kernel: Lustre: Skipped 402 previous similar messages
Sep 06 04:20:28 csd3-mds2 kernel: Lustre: 13789:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898421/real 1630898421]  req@ffff9f8297353180 x1709807697777088/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898428 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 04:21:24 csd3-mds2 kernel: Lustre: 5457:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898477/real 1630898477]  req@ffff9f857ff73a80 x1709807697865536/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898484 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:21:38 csd3-mds2 kernel: Lustre: 5457:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630898491/real 1630898491]  req@ffff9f857ff73a80 x1709807697865536/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630898498 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 04:21:38 csd3-mds2 kernel: Lustre: 5457:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 04:22:17 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 04:22:17 csd3-mds2 kernel: Lustre: Skipped 392 previous similar messages
Sep 06 04:22:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 04:22:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 11, rc: 32
Sep 06 04:22:45 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.168@o2ib2: -125
Sep 06 04:27:39 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa678726000, cur 1630898859 expire 1630898709 last 1630898632
Sep 06 04:29:32 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:29:32 csd3-mds2 kernel: LustreError: Skipped 132 previous similar messages
Sep 06 04:30:23 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 04:30:23 csd3-mds2 kernel: Lustre: Skipped 216 previous similar messages
Sep 06 04:31:27 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88c6c1c000, cur 1630899087 expire 1630898937 last 1630898860
Sep 06 04:32:12 csd3-mds2 kernel: Lustre: 13790:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630899125/real 1630899125]  req@ffff9fa4657bde80 x1709807698788096/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630899132 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:32:17 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 04:32:17 csd3-mds2 kernel: Lustre: Skipped 225 previous similar messages
Sep 06 04:35:17 csd3-mds2 kernel: Lustre: 5451:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630899310/real 1630899310]  req@ffff9f763bf1ad00 x1709807699060160/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630899317 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:39:34 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.20.88@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:39:34 csd3-mds2 kernel: LustreError: Skipped 221 previous similar messages
Sep 06 04:40:24 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 04:40:24 csd3-mds2 kernel: Lustre: Skipped 220 previous similar messages
Sep 06 04:42:18 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 04:42:18 csd3-mds2 kernel: Lustre: Skipped 214 previous similar messages
Sep 06 04:49:34 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:49:34 csd3-mds2 kernel: LustreError: Skipped 184 previous similar messages
Sep 06 04:50:26 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 04:50:26 csd3-mds2 kernel: Lustre: Skipped 454 previous similar messages
Sep 06 04:52:20 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 04:52:20 csd3-mds2 kernel: Lustre: Skipped 473 previous similar messages
Sep 06 04:57:30 csd3-mds2 kernel: Lustre: 26745:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900643/real 1630900643]  req@ffff9fa2bef8a880 x1709807700867328/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630900650 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:58:25 csd3-mds2 kernel: Lustre: 26745:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900698/real 1630900698]  req@ffff9f986861ec00 x1709807700945216/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630900705 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 04:59:39 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 04:59:39 csd3-mds2 kernel: LustreError: Skipped 86 previous similar messages
Sep 06 05:00:36 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 05:00:36 csd3-mds2 kernel: Lustre: Skipped 373 previous similar messages
Sep 06 05:01:32 csd3-mds2 kernel: Lustre: 13784:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900885/real 1630900885]  req@ffff9fa2bf409b00 x1709807701229504/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630900892 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 05:01:39 csd3-mds2 kernel: Lustre: 13784:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900892/real 1630900892]  req@ffff9fa2bf409b00 x1709807701229504/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630900899 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 05:01:46 csd3-mds2 kernel: Lustre: 13821:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900899/real 1630900899]  req@ffff9f72ce7c5580 x1709807701245184/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630900906 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 05:01:46 csd3-mds2 kernel: Lustre: 13821:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 05:02:21 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 05:02:21 csd3-mds2 kernel: Lustre: Skipped 318 previous similar messages
Sep 06 05:02:59 csd3-mds2 kernel: Lustre: 13803:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630900972/real 1630900972]  req@ffff9f72ed189f80 x1709807701361472/t0(0) o104->rds-d4-MDT0000@10.47.20.186@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630900979 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 05:09:59 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 05:09:59 csd3-mds2 kernel: LustreError: Skipped 46 previous similar messages
Sep 06 05:10:37 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 05:10:37 csd3-mds2 kernel: Lustre: Skipped 194 previous similar messages
Sep 06 05:12:21 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 05:12:21 csd3-mds2 kernel: Lustre: Skipped 228 previous similar messages
Sep 06 05:14:45 csd3-mds2 kernel: Lustre: 26784:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630901678/real 1630901678]  req@ffff9f98075dd580 x1709807702469376/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630901685 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 05:20:05 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 05:20:05 csd3-mds2 kernel: LustreError: Skipped 81 previous similar messages
Sep 06 05:20:46 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 05:20:46 csd3-mds2 kernel: Lustre: Skipped 409 previous similar messages
Sep 06 05:22:22 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 05:22:22 csd3-mds2 kernel: Lustre: Skipped 335 previous similar messages
Sep 06 05:30:06 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 05:30:06 csd3-mds2 kernel: LustreError: Skipped 146 previous similar messages
Sep 06 05:30:47 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 05:30:47 csd3-mds2 kernel: Lustre: Skipped 312 previous similar messages
Sep 06 05:32:24 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) reconnecting
Sep 06 05:32:24 csd3-mds2 kernel: Lustre: Skipped 368 previous similar messages
Sep 06 05:35:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
Sep 06 05:35:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 13, rc: 32
Sep 06 05:35:08 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 05:35:08 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 2 previous similar messages
Sep 06 05:39:34 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f970b492400, cur 1630903174 expire 1630903024 last 1630902947
Sep 06 05:40:10 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 05:40:10 csd3-mds2 kernel: LustreError: Skipped 160 previous similar messages
Sep 06 05:40:50 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 05:40:50 csd3-mds2 kernel: Lustre: Skipped 459 previous similar messages
Sep 06 05:42:25 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 05:42:25 csd3-mds2 kernel: Lustre: Skipped 444 previous similar messages
Sep 06 05:50:25 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 05:50:25 csd3-mds2 kernel: LustreError: Skipped 82 previous similar messages
Sep 06 05:50:52 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 05:50:52 csd3-mds2 kernel: Lustre: Skipped 448 previous similar messages
Sep 06 05:51:40 csd3-mds2 kernel: Lustre: 26811:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630903893/real 1630903893]  req@ffff9f6c8398b600 x1709807705668800/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630903900 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 05:52:25 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 05:52:25 csd3-mds2 kernel: Lustre: Skipped 461 previous similar messages
Sep 06 05:55:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 05:55:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 14, rc: 32
Sep 06 05:55:20 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.166@o2ib2: -125
Sep 06 05:56:07 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88cab0d000, cur 1630904167 expire 1630904017 last 1630903940
Sep 06 05:56:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 05:56:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 5, rc: 32
Sep 06 05:56:10 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.167@o2ib2: accepting
Sep 06 05:56:11 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.161@o2ib2: -125
Sep 06 05:56:11 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message
Sep 06 06:00:32 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:00:32 csd3-mds2 kernel: LustreError: Skipped 117 previous similar messages
Sep 06 06:00:52 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 06:00:52 csd3-mds2 kernel: Lustre: Skipped 329 previous similar messages
Sep 06 06:01:16 csd3-mds2 kernel: Lustre: 26771:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630904469/real 1630904469]  req@ffff9f97c9d99680 x1709807706469504/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630904476 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 06:01:23 csd3-mds2 kernel: Lustre: 26771:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630904476/real 1630904476]  req@ffff9f97c9d99680 x1709807706469504/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630904483 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 06:01:30 csd3-mds2 kernel: Lustre: 26771:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630904483/real 1630904483]  req@ffff9f97c9d99680 x1709807706469504/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630904490 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 06:01:37 csd3-mds2 kernel: Lustre: 26771:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630904490/real 1630904490]  req@ffff9f97c9d99680 x1709807706469504/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630904497 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 06:02:26 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 06:02:26 csd3-mds2 kernel: Lustre: Skipped 279 previous similar messages
Sep 06 06:10:33 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.10.30@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:10:33 csd3-mds2 kernel: LustreError: Skipped 95 previous similar messages
Sep 06 06:10:55 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 06:10:55 csd3-mds2 kernel: Lustre: Skipped 326 previous similar messages
Sep 06 06:12:27 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 06:12:27 csd3-mds2 kernel: Lustre: Skipped 349 previous similar messages
Sep 06 06:20:35 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.47.20.130@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:20:35 csd3-mds2 kernel: LustreError: Skipped 75 previous similar messages
Sep 06 06:20:56 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 06:20:56 csd3-mds2 kernel: Lustre: Skipped 363 previous similar messages
Sep 06 06:22:28 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 06:22:28 csd3-mds2 kernel: Lustre: Skipped 361 previous similar messages
Sep 06 06:30:37 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:30:37 csd3-mds2 kernel: LustreError: Skipped 126 previous similar messages
Sep 06 06:30:56 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 06:30:56 csd3-mds2 kernel: Lustre: Skipped 202 previous similar messages
Sep 06 06:32:28 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) reconnecting
Sep 06 06:32:28 csd3-mds2 kernel: Lustre: Skipped 189 previous similar messages
Sep 06 06:35:23 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8483d68800, cur 1630906523 expire 1630906373 last 1630906296
Sep 06 06:40:38 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:40:38 csd3-mds2 kernel: LustreError: Skipped 79 previous similar messages
Sep 06 06:41:03 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 06:41:03 csd3-mds2 kernel: Lustre: Skipped 273 previous similar messages
Sep 06 06:42:31 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 06:42:31 csd3-mds2 kernel: Lustre: Skipped 243 previous similar messages
Sep 06 06:48:15 csd3-mds2 kernel: Lustre: 5450:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630907288/real 1630907288]  req@ffff9f8064df0d80 x1709807710694912/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630907295 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 06:50:40 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 06:50:40 csd3-mds2 kernel: LustreError: Skipped 128 previous similar messages
Sep 06 06:51:12 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 06:51:12 csd3-mds2 kernel: Lustre: Skipped 206 previous similar messages
Sep 06 06:52:36 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 06:52:36 csd3-mds2 kernel: Lustre: Skipped 213 previous similar messages
Sep 06 06:52:42 csd3-mds2 kernel: Lustre: 13809:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630907555/real 1630907555]  req@ffff9f7265ff3f00 x1709807711095616/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630907562 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 06:57:30 csd3-mds2 kernel: Lustre: 26754:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630907843/real 1630907843]  req@ffff9f809ec43f00 x1709807711512384/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630907850 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 07:00:48 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:00:48 csd3-mds2 kernel: LustreError: Skipped 127 previous similar messages
Sep 06 07:01:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 07:01:13 csd3-mds2 kernel: Lustre: Skipped 213 previous similar messages
Sep 06 07:01:20 csd3-mds2 kernel: Lustre: 13818:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630908073/real 1630908073]  req@ffff9f80b54fda00 x1709807711810368/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630908080 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 07:01:49 csd3-mds2 kernel: Lustre: 26876:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630908102/real 1630908102]  req@ffff9f809ec1cc80 x1709807711856768/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630908109 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 07:02:39 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 07:02:39 csd3-mds2 kernel: Lustre: Skipped 238 previous similar messages
Sep 06 07:04:29 csd3-mds2 kernel: Lustre: 13821:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630908262/real 1630908262]  req@ffff9f72c573e300 x1709807712083456/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630908269 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 07:04:29 csd3-mds2 kernel: Lustre: 13821:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 07:11:00 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:11:00 csd3-mds2 kernel: LustreError: Skipped 24 previous similar messages
Sep 06 07:11:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 07:11:13 csd3-mds2 kernel: Lustre: Skipped 299 previous similar messages
Sep 06 07:11:22 csd3-mds2 kernel: Lustre: 26836:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630908675/real 1630908675]  req@ffff9f80ba7e8d80 x1709807712665664/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630908682 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 07:12:40 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 07:12:40 csd3-mds2 kernel: Lustre: Skipped 301 previous similar messages
Sep 06 07:21:14 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 07:21:14 csd3-mds2 kernel: Lustre: Skipped 528 previous similar messages
Sep 06 07:21:48 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:21:48 csd3-mds2 kernel: LustreError: Skipped 137 previous similar messages
Sep 06 07:22:40 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 07:22:40 csd3-mds2 kernel: Lustre: Skipped 576 previous similar messages
Sep 06 07:31:17 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 07:31:17 csd3-mds2 kernel: Lustre: Skipped 636 previous similar messages
Sep 06 07:32:13 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:32:13 csd3-mds2 kernel: LustreError: Skipped 55 previous similar messages
Sep 06 07:32:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 07:32:41 csd3-mds2 kernel: Lustre: Skipped 590 previous similar messages
Sep 06 07:41:18 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 07:41:18 csd3-mds2 kernel: Lustre: Skipped 503 previous similar messages
Sep 06 07:42:33 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:42:33 csd3-mds2 kernel: LustreError: Skipped 96 previous similar messages
Sep 06 07:42:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 07:42:41 csd3-mds2 kernel: Lustre: Skipped 551 previous similar messages
Sep 06 07:51:18 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 07:51:18 csd3-mds2 kernel: Lustre: Skipped 539 previous similar messages
Sep 06 07:52:38 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 07:52:38 csd3-mds2 kernel: LustreError: Skipped 153 previous similar messages
Sep 06 07:52:43 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 07:52:43 csd3-mds2 kernel: Lustre: Skipped 515 previous similar messages
Sep 06 08:01:24 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 08:01:24 csd3-mds2 kernel: Lustre: Skipped 494 previous similar messages
Sep 06 08:02:39 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:02:39 csd3-mds2 kernel: LustreError: Skipped 125 previous similar messages
Sep 06 08:02:44 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2) reconnecting
Sep 06 08:02:44 csd3-mds2 kernel: Lustre: Skipped 457 previous similar messages
Sep 06 08:03:46 csd3-mds2 kernel: Lustre: 26725:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630911819/real 1630911819]  req@ffff9f74ba0bf980 x1709807716964224/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630911826 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:03:46 csd3-mds2 kernel: Lustre: 26725:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 08:07:54 csd3-mds2 kernel: Lustre: 26544:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630912067/real 1630912067]  req@ffff9f6c80a23600 x1709807717276672/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630912074 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:07:54 csd3-mds2 kernel: Lustre: 26544:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 08:11:25 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 08:11:25 csd3-mds2 kernel: Lustre: Skipped 242 previous similar messages
Sep 06 08:12:42 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:12:42 csd3-mds2 kernel: LustreError: Skipped 69 previous similar messages
Sep 06 08:12:43 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f76c3229400, cur 1630912363 expire 1630912213 last 1630912136
Sep 06 08:12:46 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 08:12:46 csd3-mds2 kernel: Lustre: Skipped 245 previous similar messages
Sep 06 08:12:57 csd3-mds2 kernel: Lustre: 26789:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630912370/real 1630912370]  req@ffff9f97a6846c00 x1709807717658816/t0(0) o104->rds-d5-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630912377 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:13:18 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) in 206 seconds. I think it's dead, and I am evicting it. exp ffff9fa2c73aac00, cur 1630912398 expire 1630912248 last 1630912192
Sep 06 08:13:18 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 08:17:36 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2) in 213 seconds. I think it's dead, and I am evicting it. exp ffff9f8913e7c000, cur 1630912656 expire 1630912506 last 1630912443
Sep 06 08:20:13 csd3-mds2 kernel: Lustre: 26796:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630912806/real 1630912806]  req@ffff9f6af2ee7080 x1709807719836928/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630912813 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:21:26 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 08:21:26 csd3-mds2 kernel: Lustre: Skipped 345 previous similar messages
Sep 06 08:21:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 08:21:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 5, rc: 32
Sep 06 08:22:49 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:22:49 csd3-mds2 kernel: LustreError: Skipped 113 previous similar messages
Sep 06 08:22:51 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 08:22:51 csd3-mds2 kernel: Lustre: Skipped 375 previous similar messages
Sep 06 08:29:38 csd3-mds2 kernel: Lustre: 26813:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630913370/real 1630913370]  req@ffff9f6af146a400 x1709807720649536/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630913377 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:31:29 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 08:31:29 csd3-mds2 kernel: Lustre: Skipped 243 previous similar messages
Sep 06 08:32:52 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 08:32:52 csd3-mds2 kernel: Lustre: Skipped 219 previous similar messages
Sep 06 08:32:52 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:32:52 csd3-mds2 kernel: LustreError: Skipped 43 previous similar messages
Sep 06 08:33:53 csd3-mds2 kernel: Lustre: 26809:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630913626/real 1630913626]  req@ffff9f9c67b9bf00 x1709807720977408/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630913633 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:33:53 csd3-mds2 kernel: Lustre: 26809:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 08:35:03 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88ca632800, cur 1630913703 expire 1630913553 last 1630913476
Sep 06 08:41:31 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 08:41:31 csd3-mds2 kernel: Lustre: Skipped 232 previous similar messages
Sep 06 08:42:53 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:42:53 csd3-mds2 kernel: LustreError: Skipped 165 previous similar messages
Sep 06 08:42:57 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 08:42:57 csd3-mds2 kernel: Lustre: Skipped 214 previous similar messages
Sep 06 08:50:41 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9fa1b56ead00/0x41c5233eed35d9aa lrc: 3/0,0 mode: PR/PR res: [0x200011ce9:0xa:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.198@tcp2 remote: 0xcbc3807f65bb3259 expref: 1019 pid: 13832 timeout: 409675 lvb_type: 0
Sep 06 08:51:21 csd3-mds2 kernel: Lustre: 26865:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 08:51:36 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 08:51:36 csd3-mds2 kernel: Lustre: Skipped 255 previous similar messages
Sep 06 08:52:22 csd3-mds2 kernel: LustreError: 26841:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 08:52:22 csd3-mds2 kernel: LustreError: 26841:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8712 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 08:52:22 csd3-mds2 kernel: Lustre: 26841:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (23308:140s); client may timeout.  req@ffff9f73eacdd580 x1709712633557440/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:644/0 lens 264/192 e 1 to 0 dl 1630914602 ref 1 fl Complete:/0/0 rc -2/-2
Sep 06 08:52:22 csd3-mds2 kernel: LNet: Service thread pid 26841 completed after 23448.21s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 08:52:40 csd3-mds2 kernel: Lustre: 26802:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630914753/real 1630914753]  req@ffff9fa2c072de80 x1709807722689792/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630914760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 08:52:40 csd3-mds2 kernel: Lustre: 26802:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 08:53:02 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 08:53:02 csd3-mds2 kernel: Lustre: Skipped 233 previous similar messages
Sep 06 08:53:49 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 08:53:49 csd3-mds2 kernel: LustreError: Skipped 88 previous similar messages
Sep 06 08:54:14 csd3-mds2 kernel: Lustre: 13832:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630914847/real 1630914847]  req@ffff9f9c61071200 x1709807722809152/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630914854 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 08:54:14 csd3-mds2 kernel: Lustre: 13832:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Sep 06 08:54:42 csd3-mds2 kernel: LNet: Service thread pid 26865 was inactive for 200.44s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 08:54:42 csd3-mds2 kernel: Pid: 26865, comm: mdt01_076 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 08:54:42 csd3-mds2 kernel: Call Trace:
Sep 06 08:54:42 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 08:54:42 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630914882.26865
Sep 06 08:57:01 csd3-mds2 kernel: Lustre: 26769:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630915014/real 1630915014]  req@ffff9f979253c380 x1709807723019584/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630915021 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 09:01:14 csd3-mds2 kernel: Lustre: 26876:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630915267/real 1630915267]  req@ffff9f9c86f00900 x1709807723350784/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630915274 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 09:01:14 csd3-mds2 kernel: Lustre: 26876:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 09:01:40 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 09:01:40 csd3-mds2 kernel: Lustre: Skipped 74 previous similar messages
Sep 06 09:03:06 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client a2647704-cc85-a7e0-0bf2-95d98f0c7b96 (at 10.43.240.198@tcp2) reconnecting
Sep 06 09:03:06 csd3-mds2 kernel: Lustre: Skipped 92 previous similar messages
Sep 06 09:03:53 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:03:53 csd3-mds2 kernel: LustreError: Skipped 48 previous similar messages
Sep 06 09:06:05 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f7fadc2f400, cur 1630915565 expire 1630915415 last 1630915338
Sep 06 09:07:01 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f7f20a8c400, cur 1630915621 expire 1630915471 last 1630915394
Sep 06 09:07:01 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 09:07:57 csd3-mds2 kernel: Lustre: 5457:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630915670/real 1630915670]  req@ffff9f9fa74e2d00 x1709807723906752/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630915677 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 09:07:57 csd3-mds2 kernel: Lustre: 5457:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Sep 06 09:11:40 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 09:11:40 csd3-mds2 kernel: Lustre: Skipped 268 previous similar messages
Sep 06 09:13:07 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 09:13:07 csd3-mds2 kernel: Lustre: Skipped 340 previous similar messages
Sep 06 09:13:56 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:13:56 csd3-mds2 kernel: LustreError: Skipped 97 previous similar messages
Sep 06 09:13:59 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa33b6b3c00, cur 1630916039 expire 1630915889 last 1630915812
Sep 06 09:13:59 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 09:21:40 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 09:21:40 csd3-mds2 kernel: Lustre: Skipped 620 previous similar messages
Sep 06 09:23:07 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 09:23:07 csd3-mds2 kernel: Lustre: Skipped 552 previous similar messages
Sep 06 09:24:15 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:24:15 csd3-mds2 kernel: LustreError: Skipped 68 previous similar messages
Sep 06 09:26:50 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa538ab9c00, cur 1630916810 expire 1630916660 last 1630916583
Sep 06 09:28:50 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f825888e000, cur 1630916930 expire 1630916780 last 1630916703
Sep 06 09:31:41 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 09:31:41 csd3-mds2 kernel: Lustre: Skipped 410 previous similar messages
Sep 06 09:33:09 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 09:33:09 csd3-mds2 kernel: Lustre: Skipped 429 previous similar messages
Sep 06 09:34:24 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:34:24 csd3-mds2 kernel: LustreError: Skipped 73 previous similar messages
Sep 06 09:38:09 csd3-mds2 kernel: Lustre: 13812:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-2208), not sending early reply
                                    req@ffff9fa2b996ba80 x1709712920160064/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:469/0 lens 264/224 e 1 to 0 dl 1630917494 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 09:41:47 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 09:41:47 csd3-mds2 kernel: Lustre: Skipped 326 previous similar messages
Sep 06 09:42:16 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa96555ec00, cur 1630917736 expire 1630917586 last 1630917509
Sep 06 09:43:11 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 09:43:11 csd3-mds2 kernel: Lustre: Skipped 303 previous similar messages
Sep 06 09:44:27 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:44:27 csd3-mds2 kernel: LustreError: Skipped 135 previous similar messages
Sep 06 09:45:02 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b5b3e1800, cur 1630917902 expire 1630917752 last 1630917675
Sep 06 09:51:49 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 09:51:49 csd3-mds2 kernel: Lustre: Skipped 307 previous similar messages
Sep 06 09:51:52 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f825b309400, cur 1630918312 expire 1630918162 last 1630918085
Sep 06 09:53:11 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) reconnecting
Sep 06 09:53:11 csd3-mds2 kernel: Lustre: Skipped 312 previous similar messages
Sep 06 09:54:42 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 09:54:42 csd3-mds2 kernel: LustreError: Skipped 91 previous similar messages
Sep 06 09:59:37 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
Sep 06 09:59:37 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 31, oc: 9, rc: 31
Sep 06 10:00:02 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 4 seconds
Sep 06 10:00:02 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 11, oc: 8, rc: 31
Sep 06 10:01:49 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 3313c416-e7de-40f4-b631-02f015084f52 (at 10.47.1.202@o2ib1)
Sep 06 10:01:49 csd3-mds2 kernel: Lustre: Skipped 598 previous similar messages
Sep 06 10:02:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 10:02:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (29): c: 0, oc: 5, rc: 32
Sep 06 10:03:11 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
Sep 06 10:03:11 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (41): c: 0, oc: 15, rc: 32
Sep 06 10:03:12 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 264885d1-4d30-1628-dc18-5127c0d24de6 (at 10.47.1.203@o2ib1) reconnecting
Sep 06 10:03:12 csd3-mds2 kernel: Lustre: Skipped 757 previous similar messages
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (54): c: 0, oc: 1, rc: 32
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.144.9.51@o2ib from 10.44.240.69@o2ib2
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 13 previous similar messages
Sep 06 10:03:24 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.144.9.51@o2ib: -113
Sep 06 10:05:21 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:05:21 csd3-mds2 kernel: LustreError: Skipped 1139 previous similar messages
Sep 06 10:06:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 10:06:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 8, rc: 32
Sep 06 10:09:16 csd3-mds2 kernel: Lustre: 5447:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630919349/real 1630919349]  req@ffff9f72bfbb8480 x1709807729705472/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630919356 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:09:16 csd3-mds2 kernel: Lustre: 5447:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 10:09:46 csd3-mds2 kernel: LustreError: 26925:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f7216cf0480 x1708349218551360/t0(0) o37->060736dc-5ffe-ebfb-3563-c0b99fb1a67a@10.47.7.11@o2ib1:121/0 lens 448/440 e 1 to 0 dl 1630919411 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 10:11:45 csd3-mds2 kernel: Lustre: 26543:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630919498/real 1630919498]  req@ffff9f82e3780d80 x1709807729955968/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630919505 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:11:45 csd3-mds2 kernel: Lustre: 26543:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 10:11:50 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 264885d1-4d30-1628-dc18-5127c0d24de6 (at 10.47.1.203@o2ib1)
Sep 06 10:11:50 csd3-mds2 kernel: Lustre: Skipped 1486 previous similar messages
Sep 06 10:13:15 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 264885d1-4d30-1628-dc18-5127c0d24de6 (at 10.47.1.203@o2ib1) reconnecting
Sep 06 10:13:15 csd3-mds2 kernel: Lustre: Skipped 1322 previous similar messages
Sep 06 10:14:18 csd3-mds2 kernel: Lustre: 26867:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630919651/real 1630919651]  req@ffff9f7298fabf00 x1709807730159296/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630919658 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:14:18 csd3-mds2 kernel: Lustre: 26867:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 10:15:22 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.144.13.4@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:15:22 csd3-mds2 kernel: LustreError: Skipped 2231 previous similar messages
Sep 06 10:19:05 csd3-mds2 kernel: Lustre: 26820:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630919938/real 0]  req@ffff9f71c9570480 x1709807730816896/t0(0) o104->rds-d4-MDT0000@10.47.1.126@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630919945 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:19:05 csd3-mds2 kernel: Lustre: 26820:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Sep 06 10:19:09 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f77a1f44900/0x41c5233f00b2790c lrc: 4/0,0 mode: PR/PR res: [0x200058057:0x65e8:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c1b9e4dc expref: 299461 pid: 13816 timeout: 414983 lvb_type: 0
Sep 06 10:19:10 csd3-mds2 kernel: LustreError: 26854:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f9759a3c380 x1709807730829120/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 10:19:10 csd3-mds2 kernel: LustreError: 26854:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
Sep 06 10:19:13 csd3-mds2 kernel: LustreError: 26802:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f9e5cb80900 x1709807730833472/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 10:19:14 csd3-mds2 kernel: LustreError: 26753:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f97484df500 x1709807730834816/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 10:19:14 csd3-mds2 kernel: LustreError: 26753:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Sep 06 10:20:49 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f88bf93cb40/0x41c5233f00afc94c lrc: 3/0,0 mode: PR/PR res: [0x200058057:0x61f8:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c1b9c769 expref: 26145 pid: 26798 timeout: 415083 lvb_type: 0
Sep 06 10:20:53 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9fa23cecaac0/0x41c5233f00b194f9 lrc: 3/0,0 mode: PR/PR res: [0x200058057:0x641b:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c1b9d7df expref: 20713 pid: 13816 timeout: 415087 lvb_type: 0
Sep 06 10:21:04 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8933959000, cur 1630920064 expire 1630919914 last 1630919837
Sep 06 10:21:04 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 10:21:30 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
Sep 06 10:21:30 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 1, rc: 32
Sep 06 10:21:53 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 10:21:53 csd3-mds2 kernel: Lustre: Skipped 654 previous similar messages
Sep 06 10:23:29 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 80560c5c-3c7e-04a3-df3d-a9dcfe515124 (at 10.43.240.199@tcp2) reconnecting
Sep 06 10:23:29 csd3-mds2 kernel: Lustre: Skipped 625 previous similar messages
Sep 06 10:26:02 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:26:02 csd3-mds2 kernel: LustreError: Skipped 45 previous similar messages
Sep 06 10:28:52 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
Sep 06 10:28:52 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 18, oc: 3, rc: 31
Sep 06 10:28:52 csd3-mds2 kernel: LustreError: 10887:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff9f88cb0a2c00
Sep 06 10:29:17 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.47.7.12@o2ib1  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f84d5c8b840/0x41c5233f063d4ed1 lrc: 3/0,0 mode: PR/PR res: [0x200000bd2:0x2:0x0].0x0 bits 0x13/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.47.7.12@o2ib1 remote: 0xa1b61a4c0a18136f expref: 65 pid: 26803 timeout: 415591 lvb_type: 0
Sep 06 10:29:17 csd3-mds2 kernel: LustreError: 26841:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export ffff9f8b26799800 ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f7baa80d680/0x41c5233f064e5fcd lrc: 3/0,0 mode: PR/PR res: [0x200000bd2:0x2:0x0].0x0 bits 0x13/0x0 rrc: 7 type: IBT flags: 0x50200000000000 nid: 10.47.7.12@o2ib1 remote: 0xa1b61a4c0a182cba expref: 61 pid: 26841 timeout: 0 lvb_type: 0
Sep 06 10:29:35 csd3-mds2 kernel: LustreError: 26550:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504966s  req@ffff9f71cf4e5580 x1708357498630912/t0(0) o37->40a883ed-8fd4-6381-3ce6-dfd32eb9df03@10.47.7.9@o2ib1:535/0 lens 448/440 e 2 to 0 dl 1630920580 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 10:30:22 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.47.1.255@o2ib1  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f7d715f3180/0x41c5233f0637427a lrc: 3/0,0 mode: PR/PR res: [0x20005777b:0x1597:0x0].0x0 bits 0x13/0x0 rrc: 39 type: IBT flags: 0x60200400000020 nid: 10.47.1.255@o2ib1 remote: 0x1df4285111e1d60a expref: 15 pid: 13799 timeout: 415656 lvb_type: 0
Sep 06 10:30:45 csd3-mds2 kernel: Lustre: 26795:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630920638/real 1630920638]  req@ffff9f98dc70f980 x1709807731969920/t0(0) o104->rds-d4-MDT0000@10.47.2.43@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630920645 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:30:45 csd3-mds2 kernel: Lustre: 26795:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Sep 06 10:31:54 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to a6782f48-69a2-3388-872c-19a28f20914b (at 10.47.1.195@o2ib1)
Sep 06 10:31:54 csd3-mds2 kernel: Lustre: Skipped 152 previous similar messages
Sep 06 10:32:49 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa34034c000, cur 1630920769 expire 1630920619 last 1630920542
Sep 06 10:33:37 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 80560c5c-3c7e-04a3-df3d-a9dcfe515124 (at 10.43.240.199@tcp2) reconnecting
Sep 06 10:33:37 csd3-mds2 kernel: Lustre: Skipped 148 previous similar messages
Sep 06 10:36:24 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.1.255@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:36:24 csd3-mds2 kernel: LustreError: Skipped 44 previous similar messages
Sep 06 10:40:38 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 10:40:38 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 15, rc: 32
Sep 06 10:40:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds
Sep 06 10:40:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (1): c: 0, oc: 4, rc: 32
Sep 06 10:41:55 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 0dd1bf78-2b16-e051-a937-9565eb11ae3a (at 10.47.1.193@o2ib1)
Sep 06 10:41:55 csd3-mds2 kernel: Lustre: Skipped 384 previous similar messages
Sep 06 10:42:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
Sep 06 10:42:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 2, rc: 32
Sep 06 10:42:07 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.167@o2ib2: accepting
Sep 06 10:42:43 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 53s: evicting client at 10.47.1.215@o2ib1  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f87d10b4900/0x41c5233f06fd41b2 lrc: 3/0,0 mode: PR/PR res: [0x200000bd2:0x2:0x0].0x0 bits 0x13/0x0 rrc: 55 type: IBT flags: 0x60200400000020 nid: 10.47.1.215@o2ib1 remote: 0x68910abc720e26cd expref: 9 pid: 26743 timeout: 416397 lvb_type: 0
Sep 06 10:42:43 csd3-mds2 kernel: Lustre: 13817:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630921356/real 1630921356]  req@ffff9f71f5701200 x1709807733520960/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630921363 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:42:43 csd3-mds2 kernel: Lustre: 13817:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Sep 06 10:43:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 5a52bf39-8f0f-0897-350e-4c943fbe9512 (at 10.47.1.206@o2ib1) reconnecting
Sep 06 10:43:41 csd3-mds2 kernel: Lustre: Skipped 424 previous similar messages
Sep 06 10:46:26 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 80560c5c-3c7e-04a3-df3d-a9dcfe515124 (at 10.43.240.199@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f76c322d800, cur 1630921586 expire 1630921436 last 1630921359
Sep 06 10:46:26 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 161s: evicting client at 10.47.1.215@o2ib1  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f78ac241440/0x41c5233f074a5872 lrc: 3/0,0 mode: PR/PR res: [0x20005777b:0x1597:0x0].0x0 bits 0x13/0x0 rrc: 95 type: IBT flags: 0x60200400000020 nid: 10.47.1.215@o2ib1 remote: 0x68910abc720e2728 expref: 12 pid: 26543 timeout: 416620 lvb_type: 0
Sep 06 10:46:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 10:46:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 10:46:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 10:46:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 10:46:57 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.1.203@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:46:57 csd3-mds2 kernel: LustreError: Skipped 53 previous similar messages
Sep 06 10:47:48 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 10:47:48 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 10:48:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 10:48:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 8, rc: 32
Sep 06 10:48:51 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.167@o2ib2: accepting
Sep 06 10:48:52 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.44.240.167@o2ib2: -125
Sep 06 10:48:52 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 10 previous similar messages
Sep 06 10:50:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 10:50:07 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 1, rc: 32
Sep 06 10:50:08 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.161@o2ib2: -125
Sep 06 10:50:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 10:50:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 10:52:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 5a52bf39-8f0f-0897-350e-4c943fbe9512 (at 10.47.1.206@o2ib1)
Sep 06 10:52:01 csd3-mds2 kernel: Lustre: Skipped 260 previous similar messages
Sep 06 10:52:49 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa66cb2cc00, cur 1630921969 expire 1630921819 last 1630921742
Sep 06 10:52:49 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 10:52:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 10:52:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 2 previous similar messages
Sep 06 10:52:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 10:52:51 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 2 previous similar messages
Sep 06 10:52:52 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.44.240.167@o2ib2: -125
Sep 06 10:52:52 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 2 previous similar messages
Sep 06 10:53:41 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 8e606ec5-4efb-1eac-ed68-efa93f8ffaf8 (at 10.47.1.215@o2ib1) reconnecting
Sep 06 10:53:41 csd3-mds2 kernel: Lustre: Skipped 259 previous similar messages
Sep 06 10:54:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa976b8c000, cur 1630922081 expire 1630921931 last 1630921854
Sep 06 10:55:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 2 seconds
Sep 06 10:55:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 10:55:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 8, rc: 32
Sep 06 10:55:10 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 10:56:20 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f81acb01000, cur 1630922180 expire 1630922030 last 1630921953
Sep 06 10:56:20 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 10:57:04 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 10:57:04 csd3-mds2 kernel: LustreError: Skipped 47 previous similar messages
Sep 06 10:57:24 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client e407cb41-8659-f766-3dd8-353427d06dfc (at 10.47.1.255@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f89f5d26400, cur 1630922244 expire 1630922094 last 1630922017
Sep 06 10:57:25 csd3-mds2 kernel: Lustre: 26794:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630922234/real 1630922234]  req@ffff9f71bd6cbf00 x1709807735059200/t0(0) o104->rds-d4-MDT0000@10.47.1.255@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630922245 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 10:57:25 csd3-mds2 kernel: Lustre: 26794:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
Sep 06 10:57:39 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 201 seconds. I think it's dead, and I am evicting it. exp ffff9fa668fde000, cur 1630922259 expire 1630922109 last 1630922058
Sep 06 10:57:39 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 10:58:40 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) in 170 seconds. I think it's dead, and I am evicting it. exp ffff9f86d9ef8800, cur 1630922320 expire 1630922170 last 1630922150
Sep 06 10:59:27 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.47.1.215@o2ib1  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f78bbdd5680/0x41c5233f08503e72 lrc: 3/0,0 mode: PR/PR res: [0x200000bd2:0x2:0x0].0x0 bits 0x13/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.47.1.215@o2ib1 remote: 0x68910abc720e2878 expref: 7 pid: 13791 timeout: 417401 lvb_type: 0
Sep 06 11:00:10 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client a2647704-cc85-a7e0-0bf2-95d98f0c7b96 (at 10.43.240.198@tcp2) in 217 seconds. I think it's dead, and I am evicting it. exp ffff9fa340dc0400, cur 1630922410 expire 1630922260 last 1630922193
Sep 06 11:01:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 11:01:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 11:01:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 1, rc: 32
Sep 06 11:01:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 11:01:06 csd3-mds2 kernel: Lustre: 13795:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 11:02:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 264885d1-4d30-1628-dc18-5127c0d24de6 (at 10.47.1.203@o2ib1)
Sep 06 11:02:01 csd3-mds2 kernel: Lustre: Skipped 237 previous similar messages
Sep 06 11:02:13 csd3-mds2 kernel: LustreError: 26865:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 11:02:14 csd3-mds2 kernel: LustreError: 26865:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8840 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 11:02:14 csd3-mds2 kernel: Lustre: 26865:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (7585:268s); client may timeout.  req@ffff9fa2b996ba80 x1709712920160064/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:469/0 lens 264/192 e 1 to 0 dl 1630922266 ref 1 fl Complete:/0/0 rc -2/-2
Sep 06 11:02:14 csd3-mds2 kernel: LNet: Service thread pid 26865 completed after 7852.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 11:03:45 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 41be58d8-a59b-7f18-b2d9-59b6284ad92d (at 10.47.1.123@o2ib1) reconnecting
Sep 06 11:03:45 csd3-mds2 kernel: Lustre: Skipped 201 previous similar messages
Sep 06 11:04:20 csd3-mds2 kernel: LustreError: 27031:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504966s  req@ffff9f6c84f15100 x1709732927168192/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:356/0 lens 448/440 e 4 to 0 dl 1630922666 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 11:05:41 csd3-mds2 kernel: LNetError: 26760:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.7.14@o2ib1 from <?>
Sep 06 11:05:41 csd3-mds2 kernel: LNetError: 26760:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 10 previous similar messages
Sep 06 11:05:41 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.43.22.3@tcp2: -113
Sep 06 11:05:42 csd3-mds2 kernel: LNetError: 26861:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.7.14@o2ib1 from <?>
Sep 06 11:05:42 csd3-mds2 kernel: LNetError: 26861:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 174791 previous similar messages
Sep 06 11:05:43 csd3-mds2 kernel: LNetError: 26556:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.7.14@o2ib1 from <?>
Sep 06 11:05:43 csd3-mds2 kernel: LNetError: 26556:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 420904 previous similar messages
Sep 06 11:05:45 csd3-mds2 kernel: LNetError: 26894:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.7.14@o2ib1 from <?>
Sep 06 11:05:45 csd3-mds2 kernel: LNetError: 26894:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 1280384 previous similar messages
Sep 06 11:05:49 csd3-mds2 kernel: LNetError: 26861:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.7.14@o2ib1 from <?>
Sep 06 11:05:49 csd3-mds2 kernel: LNetError: 26861:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 1715628 previous similar messages
Sep 06 11:05:53 csd3-mds2 kernel: LustreError: 26811:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 10.47.7.14@o2ib1) failed to reply to blocking AST (req@ffff9f71b4591f80 x1709807736116224 status 0 rc -110), evict it ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f78d375a400/0x41c5233f08b74297 lrc: 4/0,0 mode: PR/PR res: [0x2000354a1:0x14a1:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.47.7.14@o2ib1 remote: 0x719799ab203f8d5b expref: 4882 pid: 26830 timeout: 417886 lvb_type: 0
Sep 06 11:05:53 csd3-mds2 kernel: LustreError: 138-a: rds-d5-MDT0000: A client on nid 10.47.7.14@o2ib1 was evicted due to a lock blocking callback time out: rc -110
Sep 06 11:05:53 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 10.47.7.14@o2ib1  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f825b17e880/0x41c5233f08b72ea1 lrc: 3/0,0 mode: PR/PR res: [0x2000354a1:0x147d:0x0].0x0 bits 0x1b/0x0 rrc: 9 type: IBT flags: 0x60200400000020 nid: 10.47.7.14@o2ib1 remote: 0x719799ab203f88ed expref: 4883 pid: 26830 timeout: 0 lvb_type: 0
Sep 06 11:05:53 csd3-mds2 kernel: LustreError: 26745:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f6c8dc2de80 x1709807736117184/t0(0) o104->rds-d5-MDT0000@10.47.7.14@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630922760 ref 1 fl Rpc:EeX/2/ffffffff rc -5/-1
Sep 06 11:05:53 csd3-mds2 kernel: LustreError: 26811:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) Skipped 21 previous similar messages
Sep 06 11:05:57 csd3-mds2 kernel: LNetError: 26803:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 10.47.1.60@o2ib1 from <?>
Sep 06 11:05:57 csd3-mds2 kernel: LNetError: 26803:0:(lib-move.c:2005:lnet_handle_find_routed_path()) Skipped 2378563 previous similar messages
Sep 06 11:06:12 csd3-mds2 kernel: LustreError: 26803:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 10.47.1.60@o2ib1) failed to reply to blocking AST (req@ffff9f97287e5100 x1709807736142400 status 0 rc -110), evict it ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f9a2c445200/0x41c5233f08e6607e lrc: 4/0,0 mode: PR/PR res: [0x20002e2dc:0x4a4e:0x0].0x0 bits 0x13/0x0 rrc: 9 type: IBT flags: 0x60200400000020 nid: 10.47.1.60@o2ib1 remote: 0xe09ca042a23314f6 expref: 26 pid: 26544 timeout: 417905 lvb_type: 0
Sep 06 11:06:12 csd3-mds2 kernel: LustreError: 138-a: rds-d5-MDT0000: A client on nid 10.47.1.60@o2ib1 was evicted due to a lock blocking callback time out: rc -110
Sep 06 11:06:12 csd3-mds2 kernel: LustreError: Skipped 20 previous similar messages
Sep 06 11:06:12 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 10.47.1.60@o2ib1  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f9a2c445200/0x41c5233f08e6607e lrc: 3/0,0 mode: PR/PR res: [0x20002e2dc:0x4a4e:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.47.1.60@o2ib1 remote: 0xe09ca042a23314f6 expref: 27 pid: 26544 timeout: 0 lvb_type: 0
Sep 06 11:06:56 csd3-mds2 kernel: LNet: Service thread pid 13795 was inactive for 350.37s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 11:06:56 csd3-mds2 kernel: Pid: 13795, comm: mdt00_083 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 11:06:56 csd3-mds2 kernel: Call Trace:
Sep 06 11:06:56 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 11:06:56 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630922816.13795
Sep 06 11:07:09 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 9853e7b6-0493-bd2a-72ce-1959f9568dfc (at 10.43.240.201@tcp2) in 201 seconds. I think it's dead, and I am evicting it. exp ffff9f8b247c4800, cur 1630922829 expire 1630922679 last 1630922628
Sep 06 11:07:19 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.12.16@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 11:07:19 csd3-mds2 kernel: LustreError: Skipped 52 previous similar messages
Sep 06 11:10:13 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 43cf8b1e-4514-7dbc-0c89-8a2e477d7c94 (at 10.47.1.214@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b29fabc00, cur 1630923013 expire 1630922863 last 1630922786
Sep 06 11:10:13 csd3-mds2 kernel: Lustre: Skipped 15 previous similar messages
Sep 06 11:12:27 csd3-mds2 kernel: Lustre: 5453:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630923136/real 1630923136]  req@ffff9f6ac91c1f80 x1709807738068608/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630923147 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 11:12:27 csd3-mds2 kernel: Lustre: 5453:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 7959592 previous similar messages
Sep 06 11:12:48 csd3-mds2 kernel: Lustre: 5451:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-102), not sending early reply
                                    req@ffff9f71b4587980 x1709713003260416/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:108/0 lens 264/224 e 7 to 0 dl 1630923173 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 11:12:54 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 11:12:54 csd3-mds2 kernel: Lustre: Skipped 2408 previous similar messages
Sep 06 11:13:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 11:13:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 9 previous similar messages
Sep 06 11:13:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 7, rc: 32
Sep 06 11:13:03 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 9 previous similar messages
Sep 06 11:13:29 csd3-mds2 kernel: LustreError: 26929:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504966s  req@ffff9f6ae25cba80 x1708998457388928/t0(0) o37->14a54d7c-a58c-8550-3033-ec1e78bfa255@10.43.101.29@tcp2:181/0 lens 448/440 e 0 to 0 dl 1630923246 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 11:13:46 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 11:13:46 csd3-mds2 kernel: Lustre: Skipped 2368 previous similar messages
Sep 06 11:17:59 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 11:17:59 csd3-mds2 kernel: LustreError: Skipped 47 previous similar messages
Sep 06 11:20:12 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.168@o2ib2: accepting
Sep 06 11:22:31 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.168@o2ib2: accepting
Sep 06 11:22:32 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 11:22:32 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 14471 previous similar messages
Sep 06 11:22:58 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 340d26ba-2ddc-41b4-94af-df15b8eaf89d (at 10.47.1.123@o2ib1)
Sep 06 11:22:58 csd3-mds2 kernel: Lustre: Skipped 281 previous similar messages
Sep 06 11:23:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 11:23:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 9 previous similar messages
Sep 06 11:23:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 11:23:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 9 previous similar messages
Sep 06 11:23:47 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 05447bfd-85b7-a014-f89d-19c8fc135558 (at 10.144.13.2@o2ib) reconnecting
Sep 06 11:23:47 csd3-mds2 kernel: Lustre: Skipped 279 previous similar messages
Sep 06 11:29:33 csd3-mds2 kernel: Lustre: 13794:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630924166/real 1630924166]  req@ffff9f71b4594c80 x1709807743468032/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630924173 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 11:29:33 csd3-mds2 kernel: Lustre: 13794:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 11:29:49 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 11:29:49 csd3-mds2 kernel: LustreError: Skipped 26 previous similar messages
Sep 06 11:33:04 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 94115ccb-6000-bb63-0d3f-df654e6c85dd (at 10.47.20.76@o2ib1)
Sep 06 11:33:04 csd3-mds2 kernel: Lustre: Skipped 638 previous similar messages
Sep 06 11:33:52 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client ce1bd5dc-b881-5e85-b03b-668b00b04a58 (at 10.47.21.98@o2ib1) reconnecting
Sep 06 11:33:52 csd3-mds2 kernel: Lustre: Skipped 648 previous similar messages
Sep 06 11:33:53 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 10 seconds
Sep 06 11:33:53 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 11:33:53 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 27, oc: 8, rc: 31
Sep 06 11:33:53 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 11:34:32 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 11:39:53 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.9.56@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 11:39:53 csd3-mds2 kernel: LustreError: Skipped 11 previous similar messages
Sep 06 11:43:06 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 11:43:06 csd3-mds2 kernel: Lustre: Skipped 425 previous similar messages
Sep 06 11:43:55 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 11:43:55 csd3-mds2 kernel: Lustre: Skipped 437 previous similar messages
Sep 06 11:45:40 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 11:45:40 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 11:45:40 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 11:45:40 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 11:48:56 csd3-mds2 kernel: Lustre: 13800:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630925329/real 1630925329]  req@ffff9f7163575a00 x1709807746860032/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630925336 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 11:48:56 csd3-mds2 kernel: Lustre: 13800:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 11:50:24 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 11:50:24 csd3-mds2 kernel: LustreError: Skipped 18 previous similar messages
Sep 06 11:53:07 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 955e4d48-ef78-67a9-bcfc-ce66b7e784aa (at 10.47.0.219@o2ib1)
Sep 06 11:53:07 csd3-mds2 kernel: Lustre: Skipped 403 previous similar messages
Sep 06 11:54:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 11:54:01 csd3-mds2 kernel: Lustre: Skipped 390 previous similar messages
Sep 06 11:59:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
Sep 06 11:59:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 11:59:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 2, rc: 32
Sep 06 11:59:33 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 12:00:31 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:00:31 csd3-mds2 kernel: LustreError: Skipped 11 previous similar messages
Sep 06 12:03:07 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 12:03:07 csd3-mds2 kernel: Lustre: Skipped 197 previous similar messages
Sep 06 12:03:41 csd3-mds2 kernel: Lustre: 26749:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630926214/real 1630926214]  req@ffff9f77d8bb0900 x1709807749639744/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630926221 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 12:03:41 csd3-mds2 kernel: Lustre: 26749:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 12:04:02 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 12:04:02 csd3-mds2 kernel: Lustre: Skipped 190 previous similar messages
Sep 06 12:04:48 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f88c2c67cc0/0x41c5233f0fcfd552 lrc: 3/0,0 mode: PR/PR res: [0x20000c47a:0x41:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.198@tcp2 remote: 0xcbc3807f65cfe79e expref: 1023 pid: 26812 timeout: 421322 lvb_type: 0
Sep 06 12:04:48 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar message
Sep 06 12:10:20 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f848a17a000, cur 1630926620 expire 1630926470 last 1630926393
Sep 06 12:10:20 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 12:10:51 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:10:51 csd3-mds2 kernel: LustreError: Skipped 28 previous similar messages
Sep 06 12:12:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 seconds
Sep 06 12:12:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 12:12:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 11, rc: 32
Sep 06 12:12:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 12:13:11 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 12:13:11 csd3-mds2 kernel: Lustre: Skipped 62 previous similar messages
Sep 06 12:14:08 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client e1d8ec65-067d-1a45-40ad-7113a7e5147e (at 10.43.9.55@tcp2) reconnecting
Sep 06 12:14:08 csd3-mds2 kernel: Lustre: Skipped 57 previous similar messages
Sep 06 12:18:40 csd3-mds2 kernel: Lustre: 26816:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 12:21:01 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.13@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:21:01 csd3-mds2 kernel: LustreError: Skipped 44 previous similar messages
Sep 06 12:21:25 csd3-mds2 kernel: Lustre: 26880:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630927278/real 1630927278]  req@ffff9fa2b9968000 x1709807753821312/t0(0) o104->rds-d5-MDT0000@10.47.7.15@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630927285 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 12:21:25 csd3-mds2 kernel: Lustre: 26880:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 12:22:01 csd3-mds2 kernel: LNet: Service thread pid 26816 was inactive for 200.41s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 12:22:01 csd3-mds2 kernel: Pid: 26816, comm: mdt00_047 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 12:22:01 csd3-mds2 kernel: Call Trace:
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 12:22:01 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 12:22:01 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630927321.26816
Sep 06 12:22:54 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 12:22:54 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 12:22:54 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 5, rc: 32
Sep 06 12:22:54 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 12:23:26 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 12:23:26 csd3-mds2 kernel: Lustre: Skipped 238 previous similar messages
Sep 06 12:24:00 csd3-mds2 kernel: LustreError: 13795:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 12:24:00 csd3-mds2 kernel: LustreError: 13795:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8848 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 12:24:00 csd3-mds2 kernel: Lustre: 13795:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (3865:1109s); client may timeout.  req@ffff9f71b4587980 x1709713003260416/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:108/0 lens 264/192 e 7 to 0 dl 1630926331 ref 1 fl Complete:/0/0 rc -2/-2
Sep 06 12:24:00 csd3-mds2 kernel: LNet: Service thread pid 13795 completed after 4974.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 12:24:14 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 12:24:14 csd3-mds2 kernel: Lustre: Skipped 241 previous similar messages
Sep 06 12:28:35 csd3-mds2 kernel: Lustre: 26836:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
                                    req@ffff9f9700bdf980 x1709713034464960/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:125/0 lens 264/224 e 7 to 0 dl 1630927720 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 12:28:35 csd3-mds2 kernel: LustreError: 10887:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff9fa3402dc800
Sep 06 12:29:14 csd3-mds2 kernel: LustreError: 30550:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504966s  req@ffff9f9724c6d580 x1708350429192768/t0(0) o37->cbab95b5-6566-9faa-b9e5-991369d3c537@10.47.7.12@o2ib1:165/0 lens 448/440 e 4 to 0 dl 1630927760 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 12:31:04 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:31:04 csd3-mds2 kernel: LustreError: Skipped 17 previous similar messages
Sep 06 12:32:18 csd3-mds2 kernel: LustreError: 27041:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f96fa039b00 x1709607512662848/t0(0) o37->d44a63dc-890f-e18f-c89a-f9806ec6d24a@10.47.20.202@o2ib1:356/0 lens 448/440 e 1 to 0 dl 1630927951 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 12:33:28 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2)
Sep 06 12:33:28 csd3-mds2 kernel: Lustre: Skipped 204 previous similar messages
Sep 06 12:34:16 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 12:34:16 csd3-mds2 kernel: Lustre: Skipped 216 previous similar messages
Sep 06 12:36:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
Sep 06 12:36:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 12:36:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 28, oc: 4, rc: 31
Sep 06 12:36:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 12:41:28 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:41:28 csd3-mds2 kernel: LustreError: Skipped 15 previous similar messages
Sep 06 12:43:29 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 12:43:29 csd3-mds2 kernel: Lustre: Skipped 657 previous similar messages
Sep 06 12:44:17 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 12:44:17 csd3-mds2 kernel: Lustre: Skipped 682 previous similar messages
Sep 06 12:50:14 csd3-mds2 kernel: Lustre: 13793:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630929007/real 1630929007]  req@ffff9f7181790000 x1709807766569280/t0(0) o104->rds-d5-MDT0000@10.47.1.188@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630929014 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 12:50:14 csd3-mds2 kernel: Lustre: 13793:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 12:51:36 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 12:51:36 csd3-mds2 kernel: LustreError: Skipped 50 previous similar messages
Sep 06 12:53:30 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 12:53:30 csd3-mds2 kernel: Lustre: Skipped 905 previous similar messages
Sep 06 12:53:39 csd3-mds2 kernel: LustreError: 21300:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f713c3c6780 x1708354065066560/t0(0) o37->9e70101c-907f-e846-35e9-b135a17994c6@10.47.7.10@o2ib1:143/0 lens 448/440 e 1 to 0 dl 1630929248 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 12:54:17 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) reconnecting
Sep 06 12:54:17 csd3-mds2 kernel: Lustre: Skipped 927 previous similar messages
Sep 06 12:57:57 csd3-mds2 kernel: Lustre: 26859:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630929470/real 1630929470]  req@ffff9f96fa108480 x1709807769285568/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630929477 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:00:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 13:00:09 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 4, rc: 32
Sep 06 13:00:09 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.166@o2ib2: accepting
Sep 06 13:02:03 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:02:03 csd3-mds2 kernel: LustreError: Skipped 19 previous similar messages
Sep 06 13:03:30 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 13:03:30 csd3-mds2 kernel: Lustre: Skipped 368 previous similar messages
Sep 06 13:04:07 csd3-mds2 kernel: Lustre: 13793:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630929840/real 1630929840]  req@ffff9f77cee32400 x1709807771489856/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630929847 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:04:20 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) reconnecting
Sep 06 13:04:20 csd3-mds2 kernel: Lustre: Skipped 334 previous similar messages
Sep 06 13:10:38 csd3-mds2 kernel: Lustre: 13801:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630930231/real 1630930231]  req@ffff9f7197032d00 x1709807773700928/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630930238 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:10:38 csd3-mds2 kernel: Lustre: 13801:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 13:12:37 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:12:37 csd3-mds2 kernel: LustreError: Skipped 19 previous similar messages
Sep 06 13:12:51 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f9c283c4900/0x41c5233f1a27022f lrc: 3/0,0 mode: PR/PR res: [0x20000f8a0:0x19:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c1f666a1 expref: 48755 pid: 26876 timeout: 425405 lvb_type: 0
Sep 06 13:13:32 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 13:13:32 csd3-mds2 kernel: Lustre: Skipped 258 previous similar messages
Sep 06 13:14:21 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 13:14:21 csd3-mds2 kernel: Lustre: Skipped 235 previous similar messages
Sep 06 13:18:53 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Timed out tx for 10.44.240.167@o2ib2: 1 seconds
Sep 06 13:18:53 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Skipped 5 previous similar messages
Sep 06 13:19:56 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
Sep 06 13:19:56 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 2, rc: 32
Sep 06 13:19:56 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Timed out tx for 10.44.240.167@o2ib2: 12 seconds
Sep 06 13:20:59 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Timed out tx for 10.44.240.167@o2ib2: 12 seconds
Sep 06 13:21:01 csd3-mds2 kernel: Lustre: 26869:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630930854/real 1630930854]  req@ffff9f713079e300 x1709807777216064/t0(0) o104->rds-d4-MDT0000@10.47.0.172@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630930861 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:21:01 csd3-mds2 kernel: Lustre: 26869:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Sep 06 13:21:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
Sep 06 13:21:23 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 0, rc: 32
Sep 06 13:21:24 csd3-mds2 kernel: LustreError: 10887:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffff9fa665a33c00
Sep 06 13:21:49 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
Sep 06 13:21:49 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 13:21:49 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (54): c: 0, oc: 10, rc: 32
Sep 06 13:21:49 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 13:22:02 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Timed out tx for 10.44.240.167@o2ib2: 12 seconds
Sep 06 13:22:07 csd3-mds2 kernel: LustreError: 28645:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504966s  req@ffff9f96dffbf080 x1708354087878464/t0(0) o37->9e70101c-907f-e846-35e9-b135a17994c6@10.47.7.10@o2ib1:318/0 lens 448/440 e 4 to 0 dl 1630930933 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 13:22:48 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:22:48 csd3-mds2 kernel: LustreError: Skipped 126 previous similar messages
Sep 06 13:22:53 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.44.240.167@o2ib2: -125
Sep 06 13:23:17 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 2 seconds
Sep 06 13:23:17 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 13:23:17 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.168@o2ib2: -125
Sep 06 13:23:32 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0b0f7438-946d-9611-88a4-5cfc06d0c71f (at 10.43.12.38@tcp2)
Sep 06 13:23:32 csd3-mds2 kernel: Lustre: Skipped 511 previous similar messages
Sep 06 13:23:56 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Timed out tx for 10.44.240.167@o2ib2: 8 seconds
Sep 06 13:23:56 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:3429:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 13:24:21 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 05447bfd-85b7-a014-f89d-19c8fc135558 (at 10.144.13.2@o2ib) reconnecting
Sep 06 13:24:21 csd3-mds2 kernel: Lustre: Skipped 600 previous similar messages
Sep 06 13:24:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
Sep 06 13:24:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 2 previous similar messages
Sep 06 13:24:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 13:24:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 2 previous similar messages
Sep 06 13:24:59 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.168@o2ib2: accepting
Sep 06 13:25:11 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 13:28:08 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 13:28:08 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 13:28:08 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.167@o2ib2 (0): c: 0, oc: 1, rc: 32
Sep 06 13:28:08 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 13:31:06 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 13:32:42 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa340198800, cur 1630931562 expire 1630931412 last 1630931335
Sep 06 13:32:57 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:32:57 csd3-mds2 kernel: LustreError: Skipped 56 previous similar messages
Sep 06 13:33:34 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 43cf8b1e-4514-7dbc-0c89-8a2e477d7c94 (at 10.47.1.214@o2ib1)
Sep 06 13:33:34 csd3-mds2 kernel: Lustre: Skipped 743 previous similar messages
Sep 06 13:34:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 2 seconds
Sep 06 13:34:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 6 previous similar messages
Sep 06 13:34:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.165@o2ib2 (0): c: 0, oc: 15, rc: 32
Sep 06 13:34:14 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 6 previous similar messages
Sep 06 13:34:15 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.168@o2ib2: -125
Sep 06 13:34:21 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client b1b00937-88ea-259e-18fa-fd936f5d4b9f (at 10.47.1.62@o2ib1) reconnecting
Sep 06 13:34:21 csd3-mds2 kernel: Lustre: Skipped 690 previous similar messages
Sep 06 13:39:22 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f76c2aca400, cur 1630931962 expire 1630931812 last 1630931735
Sep 06 13:42:51 csd3-mds2 kernel: Lustre: 13779:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630932164/real 1630932164]  req@ffff9f96beebb600 x1709807782942336/t0(0) o104->rds-d5-MDT0000@10.47.7.14@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630932171 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:42:51 csd3-mds2 kernel: Lustre: 13779:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Sep 06 13:43:32 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.7.16@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:43:32 csd3-mds2 kernel: LustreError: Skipped 22 previous similar messages
Sep 06 13:43:34 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to  (at 10.47.4.182@o2ib1)
Sep 06 13:43:34 csd3-mds2 kernel: Lustre: Skipped 3923 previous similar messages
Sep 06 13:44:23 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 13:44:23 csd3-mds2 kernel: Lustre: Skipped 4926 previous similar messages
Sep 06 13:50:28 csd3-mds2 kernel: Lustre: 26890:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630932621/real 1630932621]  req@ffff9f96bcaa2880 x1709807783767808/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630932628 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:50:28 csd3-mds2 kernel: Lustre: 26890:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 13:51:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 7 seconds
Sep 06 13:51:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
Sep 06 13:51:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 5, rc: 32
Sep 06 13:51:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 4 previous similar messages
Sep 06 13:53:34 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 13:53:34 csd3-mds2 kernel: Lustre: Skipped 1325 previous similar messages
Sep 06 13:53:41 csd3-mds2 kernel: Lustre: 26822:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630932814/real 1630932814]  req@ffff9f969a5dc380 x1709807785359808/t0(0) o104->rds-d4-MDT0000@10.47.7.15@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630932821 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 13:53:44 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 13:53:44 csd3-mds2 kernel: LustreError: Skipped 33 previous similar messages
Sep 06 13:54:23 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 13:54:23 csd3-mds2 kernel: Lustre: Skipped 314 previous similar messages
Sep 06 13:55:06 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 7608d123-0847-6d66-d5b4-61972d56a45d (at 10.47.1.119@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa340162000, cur 1630932906 expire 1630932756 last 1630932679
Sep 06 14:01:20 csd3-mds2 kernel: LustreError: 28637:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f96c3388000 x1709733079758400/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:483/0 lens 448/440 e 0 to 0 dl 1630933363 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 14:01:43 csd3-mds2 kernel: LustreError: 27018:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9fa3539e0d80 x1709733079758400/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:506/0 lens 448/440 e 0 to 0 dl 1630933386 ref 1 fl Interpret:/2/0 rc 0/0
Sep 06 14:02:53 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 87fde65e-9030-a11a-44b1-c34c78c04074 (at 10.47.21.115@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b257a7800, cur 1630933373 expire 1630933223 last 1630933146
Sep 06 14:02:53 csd3-mds2 kernel: Lustre: Skipped 2 previous similar messages
Sep 06 14:03:36 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 14:03:36 csd3-mds2 kernel: Lustre: Skipped 350 previous similar messages
Sep 06 14:03:51 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:03:51 csd3-mds2 kernel: LustreError: Skipped 21 previous similar messages
Sep 06 14:04:23 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 14:04:23 csd3-mds2 kernel: Lustre: Skipped 326 previous similar messages
Sep 06 14:08:26 csd3-mds2 kernel: Lustre: 26877:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630933699/real 0]  req@ffff9f9701cc8000 x1709807789082304/t0(0) o104->rds-d5-MDT0000@10.44.74.7@o2ib2:15/16 lens 296/224 e 0 to 1 dl 1630933706 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 14:08:26 csd3-mds2 kernel: Lustre: 26877:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 14:10:49 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.44.74.7@o2ib2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f86ace121c0/0x41c5233f29f2c89d lrc: 4/0,0 mode: PR/PR res: [0x200035d49:0x7a:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x60200400000020 nid: 10.44.74.7@o2ib2 remote: 0xa1f7419ae9f2da19 expref: 490 pid: 13783 timeout: 428883 lvb_type: 0
Sep 06 14:10:59 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client dde96a1f-a57c-8f25-b1aa-d399e76a0fd5 (at 10.44.74.7@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b26b88c00, cur 1630933859 expire 1630933709 last 1630933632
Sep 06 14:10:59 csd3-mds2 kernel: Lustre: Skipped 1 previous similar message
Sep 06 14:11:42 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 seconds
Sep 06 14:11:42 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 14:11:42 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 7, rc: 32
Sep 06 14:11:42 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 14:13:51 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 7716a572-1017-63d0-e252-85d659ca6b11 (at 10.47.20.215@o2ib1)
Sep 06 14:13:51 csd3-mds2 kernel: Lustre: Skipped 302 previous similar messages
Sep 06 14:13:56 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:13:56 csd3-mds2 kernel: LustreError: Skipped 93 previous similar messages
Sep 06 14:14:30 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 79dc30e5-f2a5-48d8-4c37-052e8aa1237a (at 10.47.7.16@o2ib1) reconnecting
Sep 06 14:14:30 csd3-mds2 kernel: Lustre: Skipped 271 previous similar messages
Sep 06 14:21:51 csd3-mds2 kernel: Lustre: 26864:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630934504/real 1630934504]  req@ffff9f70fdfd8000 x1709807790810752/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630934511 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 14:21:51 csd3-mds2 kernel: Lustre: 26864:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 14:23:54 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 14:23:54 csd3-mds2 kernel: Lustre: Skipped 130 previous similar messages
Sep 06 14:24:28 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:24:28 csd3-mds2 kernel: LustreError: Skipped 45 previous similar messages
Sep 06 14:24:36 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) reconnecting
Sep 06 14:24:36 csd3-mds2 kernel: Lustre: Skipped 144 previous similar messages
Sep 06 14:27:19 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 9edce92a-f569-5092-28fe-4a4a42e8df28 (at 10.47.20.121@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa963140c00, cur 1630934839 expire 1630934689 last 1630934612
Sep 06 14:27:34 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client d3271d43-864d-6ac6-f291-743bdff4d50b (at 10.47.20.121@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b28395000, cur 1630934854 expire 1630934704 last 1630934627
Sep 06 14:34:00 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 14:34:00 csd3-mds2 kernel: Lustre: Skipped 176 previous similar messages
Sep 06 14:34:41 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.11@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:34:41 csd3-mds2 kernel: LustreError: Skipped 28 previous similar messages
Sep 06 14:34:42 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 14:34:42 csd3-mds2 kernel: Lustre: Skipped 157 previous similar messages
Sep 06 14:39:03 csd3-mds2 kernel: Lustre: 26763:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630935536/real 1630935536]  req@ffff9f7184f34380 x1709807793876416/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630935543 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 14:39:03 csd3-mds2 kernel: Lustre: 26763:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 83 previous similar messages
Sep 06 14:40:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 seconds
Sep 06 14:40:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 14:40:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 0, rc: 32
Sep 06 14:40:44 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 14:42:33 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 14:42:33 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Skipped 1 previous similar message
Sep 06 14:42:37 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 14:42:37 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 13, rc: 32
Sep 06 14:44:04 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to  (at 10.43.101.8@tcp2)
Sep 06 14:44:04 csd3-mds2 kernel: Lustre: Skipped 377 previous similar messages
Sep 06 14:44:45 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 14:44:45 csd3-mds2 kernel: Lustre: Skipped 389 previous similar messages
Sep 06 14:45:14 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.144.13.1@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:45:14 csd3-mds2 kernel: LustreError: Skipped 24 previous similar messages
Sep 06 14:51:06 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client f20e88c9-0659-d8d9-9248-8b829fc3efaa (at 10.47.20.103@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88cb2f0000, cur 1630936266 expire 1630936116 last 1630936039
Sep 06 14:54:06 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 670f00bd-32a8-f847-98cc-0e42e1c73c8b (at 10.144.9.51@o2ib)
Sep 06 14:54:06 csd3-mds2 kernel: Lustre: Skipped 227 previous similar messages
Sep 06 14:54:45 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 5fdd1d4e-ac17-6f7e-1f4d-a3152cfd03cd (at 10.144.13.2@o2ib) reconnecting
Sep 06 14:54:45 csd3-mds2 kernel: Lustre: Skipped 230 previous similar messages
Sep 06 14:56:01 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 14:56:01 csd3-mds2 kernel: LustreError: Skipped 14 previous similar messages
Sep 06 15:03:20 csd3-mds2 kernel: Lustre: 26830:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630936993/real 1630936993]  req@ffff9f96b699cc80 x1709807796970624/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630937000 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:04:09 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 15:04:09 csd3-mds2 kernel: Lustre: Skipped 107 previous similar messages
Sep 06 15:04:49 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 1c93f741-7be7-e062-0d9e-8b4fc7ec1a51 (at 10.43.101.25@tcp2) reconnecting
Sep 06 15:04:49 csd3-mds2 kernel: Lustre: Skipped 85 previous similar messages
Sep 06 15:12:07 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 15:12:07 csd3-mds2 kernel: LustreError: Skipped 13 previous similar messages
Sep 06 15:13:48 csd3-mds2 kernel: LustreError: 27041:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f70a2543180 x1708586035602432/t0(0) o37->79dc30e5-f2a5-48d8-4c37-052e8aa1237a@10.47.7.16@o2ib1:242/0 lens 448/440 e 1 to 0 dl 1630937652 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 15:14:12 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 670f00bd-32a8-f847-98cc-0e42e1c73c8b (at 10.144.9.51@o2ib)
Sep 06 15:14:12 csd3-mds2 kernel: Lustre: Skipped 86 previous similar messages
Sep 06 15:19:51 csd3-mds2 kernel: Lustre: 26750:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630937984/real 1630937984]  req@ffff9f70a2f1e780 x1709807819222208/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630937991 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:19:51 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 9af4d05c-7cd2-c58b-c661-a38c25400279 (at 10.47.20.108@o2ib1) reconnecting
Sep 06 15:19:51 csd3-mds2 kernel: Lustre: Skipped 86 previous similar messages
Sep 06 15:20:01 csd3-mds2 kernel: Lustre: 13812:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630937994/real 1630937994]  req@ffff9f70a2c79680 x1709807819504192/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630938001 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:20:01 csd3-mds2 kernel: Lustre: 13812:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
Sep 06 15:20:48 csd3-mds2 kernel: Lustre: 26736:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630938041/real 1630938041]  req@ffff9f70a3017500 x1709807821253312/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630938048 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:20:48 csd3-mds2 kernel: Lustre: 26736:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 15:21:32 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 15:21:32 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 2, rc: 32
Sep 06 15:22:19 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 15:22:19 csd3-mds2 kernel: LustreError: Skipped 12 previous similar messages
Sep 06 15:23:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 15:23:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 15:23:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.163@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 15:23:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 15:23:39 csd3-mds2 kernel: LNet: 10887:0:(o2iblnd_cb.c:1531:kiblnd_reconnect_peer()) Abort reconnection of 10.44.240.163@o2ib2: accepting
Sep 06 15:23:40 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.161@o2ib2: -125
Sep 06 15:24:24 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 4fb3a128-58bc-bb93-bf99-e4ceb09609ac (at 10.47.1.197@o2ib1)
Sep 06 15:24:24 csd3-mds2 kernel: Lustre: Skipped 112 previous similar messages
Sep 06 15:24:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
Sep 06 15:24:55 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.163@o2ib2 (0): c: 0, oc: 8, rc: 32
Sep 06 15:25:14 csd3-mds2 kernel: Lustre: 26773:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630938307/real 1630938307]  req@ffff9f70a2f17500 x1709807822516928/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630938314 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:25:14 csd3-mds2 kernel: Lustre: 26773:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Sep 06 15:27:01 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 15:27:01 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 15:27:01 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 4, rc: 32
Sep 06 15:27:01 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 15:27:08 csd3-mds2 kernel: Lustre: 13808:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630938421/real 1630938421]  req@ffff9f9700fef500 x1709807822993280/t0(0) o104->rds-d5-MDT0000@10.47.7.12@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630938428 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:28:43 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 15:29:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 15:29:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 4 previous similar messages
Sep 06 15:29:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 4, rc: 32
Sep 06 15:29:45 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 4 previous similar messages
Sep 06 15:29:52 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client fbad91c6-70d6-cb8f-cb62-2201412d8c05 (at 10.47.7.13@o2ib1) reconnecting
Sep 06 15:29:52 csd3-mds2 kernel: Lustre: Skipped 352 previous similar messages
Sep 06 15:29:52 csd3-mds2 kernel: Lustre: 26867:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630938585/real 1630938585]  req@ffff9f70a7802400 x1709807823512640/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630938592 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:29:52 csd3-mds2 kernel: Lustre: 26867:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 15:30:49 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.161@o2ib2: -125
Sep 06 15:32:33 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.47.2.12@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 15:32:33 csd3-mds2 kernel: LustreError: Skipped 43 previous similar messages
Sep 06 15:34:25 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to c4745c03-3bc6-a19f-6660-d4a18cd366ff (at 10.43.12.41@tcp2)
Sep 06 15:34:25 csd3-mds2 kernel: Lustre: Skipped 662 previous similar messages
Sep 06 15:35:13 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 15:35:13 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 9 previous similar messages
Sep 06 15:35:13 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 15:35:13 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 9 previous similar messages
Sep 06 15:35:14 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.44.240.161@o2ib2: -125
Sep 06 15:37:14 csd3-mds2 kernel: Lustre: 26752:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630939027/real 1630939027]  req@ffff9f70a2fcda00 x1709807824719616/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630939034 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:37:14 csd3-mds2 kernel: Lustre: 26752:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 15:38:10 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 15:39:54 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client eeb2e887-99cb-8127-ef31-4e157a4c38c3 (at 10.47.20.203@o2ib1) reconnecting
Sep 06 15:39:54 csd3-mds2 kernel: Lustre: Skipped 713 previous similar messages
Sep 06 15:40:54 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.166@o2ib2: -125
Sep 06 15:42:35 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.47.2.0@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 15:42:35 csd3-mds2 kernel: LustreError: Skipped 69 previous similar messages
Sep 06 15:44:30 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 87875363-4047-a3f8-c5f1-97649c52ee60 (at 10.47.1.37@o2ib1)
Sep 06 15:44:30 csd3-mds2 kernel: Lustre: Skipped 6961 previous similar messages
Sep 06 15:47:34 csd3-mds2 kernel: Lustre: 26837:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630939647/real 1630939647]  req@ffff9f964697ad00 x1709807829021952/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630939654 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 15:47:34 csd3-mds2 kernel: Lustre: 26837:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Sep 06 15:47:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 2 seconds
Sep 06 15:47:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 10 previous similar messages
Sep 06 15:47:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 15:47:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 10 previous similar messages
Sep 06 15:47:39 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.168@o2ib2: -125
Sep 06 15:49:58 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 76e771d0-fee9-b941-8ec1-3c4f6301b7b7 (at 10.43.101.11@tcp2) reconnecting
Sep 06 15:49:58 csd3-mds2 kernel: Lustre: Skipped 7540 previous similar messages
Sep 06 15:53:01 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 15:53:01 csd3-mds2 kernel: LustreError: Skipped 25 previous similar messages
Sep 06 15:53:11 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f88e6694c00, cur 1630939991 expire 1630939841 last 1630939764
Sep 06 15:53:11 csd3-mds2 kernel: Lustre: Skipped 3 previous similar messages
Sep 06 15:54:23 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 15:54:30 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to ece69002-3124-4164-23e6-512378580b54 (at 10.43.101.17@tcp2)
Sep 06 15:54:30 csd3-mds2 kernel: Lustre: Skipped 1010 previous similar messages
Sep 06 16:00:05 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 16:00:05 csd3-mds2 kernel: Lustre: Skipped 244 previous similar messages
Sep 06 16:00:07 csd3-mds2 kernel: Lustre: 26786:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630940394/real 1630940394]  req@ffff9f96233ecc80 x1709807830499648/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630940407 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 16:00:07 csd3-mds2 kernel: Lustre: 26786:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 116 previous similar messages
Sep 06 16:01:40 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa9928a4800, cur 1630940500 expire 1630940350 last 1630940273
Sep 06 16:03:23 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:03:23 csd3-mds2 kernel: LustreError: Skipped 14 previous similar messages
Sep 06 16:04:34 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 16:04:34 csd3-mds2 kernel: Lustre: Skipped 226 previous similar messages
Sep 06 16:10:07 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 16:10:07 csd3-mds2 kernel: Lustre: Skipped 182 previous similar messages
Sep 06 16:10:14 csd3-mds2 kernel: Lustre: 26817:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630941007/real 1630941007]  req@ffff9f709637bf00 x1709807831975808/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630941014 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 16:10:14 csd3-mds2 kernel: Lustre: 26817:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 85 previous similar messages
Sep 06 16:13:27 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:13:27 csd3-mds2 kernel: LustreError: Skipped 38 previous similar messages
Sep 06 16:14:38 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 16:14:38 csd3-mds2 kernel: Lustre: Skipped 143 previous similar messages
Sep 06 16:16:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 2 seconds
Sep 06 16:16:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 16:16:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 9, rc: 32
Sep 06 16:16:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 16:16:17 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.166@o2ib2: -125
Sep 06 16:17:43 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds
Sep 06 16:17:43 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 16:17:43 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.163@o2ib2 (0): c: 0, oc: 12, rc: 32
Sep 06 16:17:43 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 16:17:54 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 134s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9fa2b29e58c0/0x41c5233f4dc9c6fd lrc: 3/0,0 mode: PR/PR res: [0x200059fd0:0x6618:0x0].0x0 bits 0x1b/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c3236d3f expref: 27013 pid: 26886 timeout: 436508 lvb_type: 0
Sep 06 16:17:54 csd3-mds2 kernel: LustreError: 26868:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f96bae83f00 x1709807832852224/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 16:17:54 csd3-mds2 kernel: LustreError: 26868:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 37 previous similar messages
Sep 06 16:18:01 csd3-mds2 kernel: LustreError: 26780:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f96bddf2400 x1709807832863424/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 16:18:01 csd3-mds2 kernel: LustreError: 26780:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Sep 06 16:18:03 csd3-mds2 kernel: LustreError: 26725:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f70a292da00 x1709807832871616/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 16:18:47 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 121s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f71856c4b40/0x41c5233f578e236e lrc: 3/0,0 mode: PR/PR res: [0x2000299ac:0x2f:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.43.240.198@tcp2 remote: 0xcbc3807f66ca4072 expref: 1008 pid: 26873 timeout: 436561 lvb_type: 0
Sep 06 16:19:37 csd3-mds2 kernel: Lustre: 26865:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 16:20:12 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client b950907d-7004-57d6-8fc1-5951d55ecc49 (at 10.47.1.49@o2ib1) reconnecting
Sep 06 16:20:12 csd3-mds2 kernel: Lustre: Skipped 260 previous similar messages
Sep 06 16:20:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 12 seconds
Sep 06 16:20:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
Sep 06 16:20:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 5, rc: 32
Sep 06 16:20:15 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 3 previous similar messages
Sep 06 16:20:16 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.168@o2ib2: -125
Sep 06 16:22:21 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9fa665920c00, cur 1630941741 expire 1630941591 last 1630941514
Sep 06 16:22:54 csd3-mds2 kernel: Lustre: 26772:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630941767/real 1630941767]  req@ffff9f9637a2d100 x1709807833323328/t0(0) o104->rds-d5-MDT0000@10.47.0.177@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630941774 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 16:22:54 csd3-mds2 kernel: Lustre: 26772:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
Sep 06 16:23:38 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.7.16@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:23:38 csd3-mds2 kernel: LustreError: Skipped 155 previous similar messages
Sep 06 16:23:44 csd3-mds2 kernel: LustreError: 26865:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 16:23:44 csd3-mds2 kernel: LustreError: 26865:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8894 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 16:24:39 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 16:24:39 csd3-mds2 kernel: Lustre: Skipped 328 previous similar messages
Sep 06 16:24:39 csd3-mds2 kernel: Lustre: 26579:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 16:24:54 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.162@o2ib2: -125
Sep 06 16:26:07 csd3-mds2 kernel: LustreError: 26816:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 16:26:08 csd3-mds2 kernel: LustreError: 26816:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8894 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 16:26:08 csd3-mds2 kernel: LNet: Service thread pid 26816 completed after 14847.25s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 16:26:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 16:26:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 10 previous similar messages
Sep 06 16:26:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.161@o2ib2 (0): c: 0, oc: 13, rc: 32
Sep 06 16:26:59 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 10 previous similar messages
Sep 06 16:27:55 csd3-mds2 kernel: LNetError: 21882:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) 10.44.240.163@o2ib2 rejected: o2iblnd fatal error
Sep 06 16:27:58 csd3-mds2 kernel: LNetError: 21882:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) 10.44.240.163@o2ib2 rejected: o2iblnd fatal error
Sep 06 16:27:59 csd3-mds2 kernel: LNetError: 21882:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) 10.44.240.163@o2ib2 rejected: o2iblnd fatal error
Sep 06 16:27:59 csd3-mds2 kernel: LNet: Service thread pid 26579 was inactive for 200.36s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 16:27:59 csd3-mds2 kernel: Pid: 26579, comm: mdt00_003 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 16:27:59 csd3-mds2 kernel: Call Trace:
Sep 06 16:27:59 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 16:27:59 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630942079.26579
Sep 06 16:28:01 csd3-mds2 kernel: LNetError: 21882:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) 10.44.240.163@o2ib2 rejected: o2iblnd fatal error
Sep 06 16:28:01 csd3-mds2 kernel: LNetError: 21882:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) Skipped 1 previous similar message
Sep 06 16:28:23 csd3-mds2 kernel: LNetError: 12738:0:(o2iblnd_cb.c:2995:kiblnd_rejected()) 10.44.240.161@o2ib2 rejected: o2iblnd fatal error
Sep 06 16:29:06 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.47.0.226@o2ib1  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f8380490240/0x41c5233f59bdd55d lrc: 3/0,0 mode: PR/PR res: [0x20001d422:0x1:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.47.0.226@o2ib1 remote: 0x190a8328406e1f5c expref: 2494 pid: 13823 timeout: 437180 lvb_type: 0
Sep 06 16:29:56 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.47.20.112@o2ib1  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f711bf96d00/0x41c5233f59bda2d5 lrc: 3/0,0 mode: PR/PR res: [0x20001d422:0x1:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.47.20.112@o2ib1 remote: 0xa143ad318e640e0 expref: 98 pid: 13823 timeout: 437230 lvb_type: 0
Sep 06 16:29:56 csd3-mds2 kernel: LustreError: 26817:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export ffff9f8b7cff9400 ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f814dc83a80/0x41c5233f59cc28df lrc: 3/0,0 mode: PR/PR res: [0x20001d422:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x50200000000000 nid: 10.47.20.112@o2ib1 remote: 0xa143ad318e643b8 expref: 86 pid: 26817 timeout: 0 lvb_type: 0
Sep 06 16:29:56 csd3-mds2 kernel: LustreError: 26817:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Sep 06 16:30:04 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 209 seconds. I think it's dead, and I am evicting it. exp ffff9f8b413ca000, cur 1630942204 expire 1630942054 last 1630941995
Sep 06 16:30:13 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 16:30:13 csd3-mds2 kernel: Lustre: Skipped 347 previous similar messages
Sep 06 16:33:49 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:33:49 csd3-mds2 kernel: LustreError: Skipped 31 previous similar messages
Sep 06 16:34:39 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 16:34:39 csd3-mds2 kernel: Lustre: Skipped 2370 previous similar messages
Sep 06 16:36:12 csd3-mds2 kernel: Lustre: 26792:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-93), not sending early reply
                                    req@ffff9f70c0d2b600 x1709713587200960/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:637/0 lens 264/224 e 0 to 0 dl 1630942577 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 16:37:24 csd3-mds2 kernel: Lustre: 26853:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630942633/real 1630942633]  req@ffff9f964dd83a80 x1709807834734592/t0(0) o104->rds-d4-MDT0000@10.43.102.60@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630942644 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 16:37:24 csd3-mds2 kernel: Lustre: 26853:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Sep 06 16:39:54 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 161s: evicting client at 10.43.102.60@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f9697cee400/0x41c5233f59f82ac5 lrc: 3/0,0 mode: PR/PR res: [0x20000ed71:0x14:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.43.102.60@tcp2 remote: 0xfebf0d60e3b48143 expref: 6 pid: 26891 timeout: 437828 lvb_type: 0
Sep 06 16:40:13 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 16:40:13 csd3-mds2 kernel: Lustre: Skipped 2292 previous similar messages
Sep 06 16:44:03 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.144.9.51@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:44:03 csd3-mds2 kernel: LustreError: Skipped 23 previous similar messages
Sep 06 16:44:39 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to af13c194-c17f-0051-c8e6-6e8f3abbd2a9 (at 10.144.13.4@o2ib)
Sep 06 16:44:39 csd3-mds2 kernel: Lustre: Skipped 246 previous similar messages
Sep 06 16:46:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
Sep 06 16:46:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 16:46:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 3, rc: 32
Sep 06 16:46:46 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 16:46:47 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 16:50:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) reconnecting
Sep 06 16:50:13 csd3-mds2 kernel: Lustre: Skipped 321 previous similar messages
Sep 06 16:54:09 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 16:54:09 csd3-mds2 kernel: LustreError: Skipped 39 previous similar messages
Sep 06 16:54:41 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 16:54:41 csd3-mds2 kernel: Lustre: Skipped 659 previous similar messages
Sep 06 17:00:11 csd3-mds2 kernel: LustreError: 26991:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f96b4abc800 x1709713751097600/t0(0) o37->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:583/0 lens 448/440 e 1 to 0 dl 1630944033 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 17:00:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) reconnecting
Sep 06 17:00:13 csd3-mds2 kernel: Lustre: Skipped 971 previous similar messages
Sep 06 17:00:19 csd3-mds2 kernel: LustreError: 30550:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f96ba155580 x1709733632077568/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:593/0 lens 448/440 e 1 to 0 dl 1630944043 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 17:02:14 csd3-mds2 kernel: Lustre: 5447:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630944127/real 1630944127]  req@ffff9f7066ee0480 x1709807837589696/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630944134 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 17:04:37 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:04:37 csd3-mds2 kernel: LustreError: Skipped 24 previous similar messages
Sep 06 17:04:42 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 17:04:42 csd3-mds2 kernel: Lustre: Skipped 692 previous similar messages
Sep 06 17:07:56 csd3-mds2 kernel: Lustre: 13816:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630944469/real 1630944469]  req@ffff9f70a2d1da00 x1709807838144768/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630944476 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 17:07:56 csd3-mds2 kernel: Lustre: 13816:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 17:09:12 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.198@tcp2  ns: mdt-rds-d5-MDT0000_UUID lock: ffff9f8b597f4fc0/0x41c5233f624630fe lrc: 3/0,0 mode: PR/PR res: [0x200017a37:0x3:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.43.240.198@tcp2 remote: 0xcbc3807f671d1c32 expref: 1015 pid: 13785 timeout: 439586 lvb_type: 0
Sep 06 17:10:27 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client d677b15c-e7cf-613d-b835-c7a809ad02bc (at 10.47.1.154@o2ib1) reconnecting
Sep 06 17:10:27 csd3-mds2 kernel: Lustre: Skipped 214 previous similar messages
Sep 06 17:10:58 csd3-mds2 kernel: Lustre: 13823:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630944651/real 1630944651]  req@ffff9f7069311680 x1709807838573824/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630944658 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 17:11:00 csd3-mds2 kernel: Lustre: 26868:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:11:14 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f6c37fbc400, cur 1630944674 expire 1630944524 last 1630944447
Sep 06 17:14:21 csd3-mds2 kernel: LNet: Service thread pid 26868 was inactive for 200.57s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 17:14:21 csd3-mds2 kernel: Pid: 26868, comm: mdt01_078 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 17:14:21 csd3-mds2 kernel: Call Trace:
Sep 06 17:14:21 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 17:14:21 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630944861.26868
Sep 06 17:14:43 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 17:14:43 csd3-mds2 kernel: Lustre: Skipped 103 previous similar messages
Sep 06 17:15:02 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:15:02 csd3-mds2 kernel: LustreError: Skipped 10 previous similar messages
Sep 06 17:15:05 csd3-mds2 kernel: LustreError: 26868:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:15:05 csd3-mds2 kernel: LustreError: 26868:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8960 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:15:05 csd3-mds2 kernel: LNet: Service thread pid 26868 completed after 244.39s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 17:18:23 csd3-mds2 kernel: Lustre: 26805:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630945096/real 1630945096]  req@ffff9f96b084ec00 x1709807839633728/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630945103 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 17:18:23 csd3-mds2 kernel: Lustre: 26805:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Sep 06 17:20:06 csd3-mds2 kernel: Lustre: 26837:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:20:29 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client c60dbf50-8771-ed3d-f24e-08033bd1f204 (at 10.47.0.221@o2ib1) reconnecting
Sep 06 17:20:29 csd3-mds2 kernel: Lustre: Skipped 113 previous similar messages
Sep 06 17:20:50 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9fa9137c98c0/0x41c5233f67ede51f lrc: 3/0,0 mode: PR/PR res: [0x200059f77:0xddaa:0x0].0x0 bits 0x1b/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c4183cc4 expref: 47789 pid: 13789 timeout: 440284 lvb_type: 0
Sep 06 17:20:50 csd3-mds2 kernel: LustreError: 5453:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f706c23f980 x1709807839881920/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 17:20:51 csd3-mds2 kernel: LustreError: 13816:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f705f7dc380 x1709807839883392/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 17:20:51 csd3-mds2 kernel: LustreError: 13816:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 9 previous similar messages
Sep 06 17:20:52 csd3-mds2 kernel: LustreError: 26721:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f70726e7980 x1709807839884608/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 17:20:54 csd3-mds2 kernel: LustreError: 26804:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f7061b09b00 x1709807839887552/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 17:20:54 csd3-mds2 kernel: LustreError: 26804:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
Sep 06 17:22:06 csd3-mds2 kernel: LustreError: 26837:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:22:06 csd3-mds2 kernel: LustreError: 26837:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8973 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:22:06 csd3-mds2 kernel: Lustre: 26837:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:24:57 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client bdf71171-00c6-07bb-54f8-93c09931f553 (at 10.43.102.60@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f8b4e793400, cur 1630945497 expire 1630945347 last 1630945270
Sep 06 17:25:05 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 17:25:05 csd3-mds2 kernel: Lustre: Skipped 101 previous similar messages
Sep 06 17:25:33 csd3-mds2 kernel: LNet: Service thread pid 26837 was inactive for 206.56s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 17:25:33 csd3-mds2 kernel: Pid: 26837, comm: mdt01_058 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 17:25:33 csd3-mds2 kernel: Call Trace:
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 17:25:33 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 17:25:33 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630945533.26837
Sep 06 17:25:39 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:25:39 csd3-mds2 kernel: LustreError: Skipped 30 previous similar messages
Sep 06 17:26:45 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f9700a03cc0/0x41c5233f690f1e87 lrc: 3/0,0 mode: PR/PR res: [0x200059fe9:0xa5e:0x0].0x0 bits 0x1b/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c4184713 expref: 24 pid: 26787 timeout: 440639 lvb_type: 0
Sep 06 17:26:45 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar message
Sep 06 17:27:24 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: haven't heard from client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f6ae5e73400, cur 1630945644 expire 1630945494 last 1630945417
Sep 06 17:28:26 csd3-mds2 kernel: LustreError: 26837:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:28:26 csd3-mds2 kernel: LustreError: 26837:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8978 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:28:26 csd3-mds2 kernel: LNet: Service thread pid 26837 completed after 379.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 17:28:26 csd3-mds2 kernel: Lustre: 13781:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:28:26 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
Sep 06 17:28:26 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 17:28:26 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 14, rc: 32
Sep 06 17:28:26 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 17:28:27 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 17:30:36 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 17:30:36 csd3-mds2 kernel: Lustre: Skipped 97 previous similar messages
Sep 06 17:35:07 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to c35cd430-b170-12bc-341f-0c48e0d92c00 (at 10.47.1.66@o2ib1)
Sep 06 17:35:07 csd3-mds2 kernel: Lustre: Skipped 97 previous similar messages
Sep 06 17:37:00 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:37:00 csd3-mds2 kernel: LustreError: Skipped 26 previous similar messages
Sep 06 17:38:19 csd3-mds2 kernel: LustreError: 13781:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:38:19 csd3-mds2 kernel: LustreError: 13781:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8985 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:38:36 csd3-mds2 kernel: Lustre: 26882:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:41:03 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 90b997bb-9bcd-d766-7ff1-37c1f740bb44 (at 10.47.0.226@o2ib1) reconnecting
Sep 06 17:41:03 csd3-mds2 kernel: Lustre: Skipped 147 previous similar messages
Sep 06 17:41:26 csd3-mds2 kernel: LustreError: 26882:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:41:26 csd3-mds2 kernel: LustreError: 26882:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8987 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:41:26 csd3-mds2 kernel: Lustre: 26843:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:42:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 3 seconds
Sep 06 17:42:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Sep 06 17:42:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 10, rc: 32
Sep 06 17:42:19 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Skipped 1 previous similar message
Sep 06 17:42:20 csd3-mds2 kernel: LNetError: 10874:0:(lib-move.c:2961:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.44.240.164@o2ib2: -125
Sep 06 17:45:13 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2)
Sep 06 17:45:13 csd3-mds2 kernel: Lustre: Skipped 141 previous similar messages
Sep 06 17:47:02 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.101.8@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:47:02 csd3-mds2 kernel: LustreError: Skipped 45 previous similar messages
Sep 06 17:47:40 csd3-mds2 kernel: LustreError: 26843:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:47:40 csd3-mds2 kernel: LustreError: 26843:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8989 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:47:40 csd3-mds2 kernel: Lustre: 26758:0:(llog_cat.c:899:llog_cat_process_or_fork()) rds-d5-MDD0000: catlog [0x7:0xa:0x0] crosses index zero
Sep 06 17:48:32 csd3-mds2 kernel: Lustre: 5451:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630946905/real 0]  req@ffff9f705fb3d100 x1709807842907456/t0(0) o104->rds-d4-MDT0000@10.47.2.89@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630946912 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 17:48:32 csd3-mds2 kernel: Lustre: 5451:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 67 previous similar messages
Sep 06 17:51:08 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client a2647704-cc85-a7e0-0bf2-95d98f0c7b96 (at 10.43.240.198@tcp2) reconnecting
Sep 06 17:51:08 csd3-mds2 kernel: Lustre: Skipped 58 previous similar messages
Sep 06 17:53:22 csd3-mds2 kernel: LNet: Service thread pid 26758 was inactive for 341.20s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Sep 06 17:53:22 csd3-mds2 kernel: Pid: 26758, comm: mdt00_020 3.10.0-1160.25.1.el7_lustre.x86_64 #1 SMP Wed Jul 7 09:59:46 UTC 2021
Sep 06 17:53:22 csd3-mds2 kernel: Call Trace:
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffff9b597cb7>] call_rwsem_down_write_failed+0x17/0x30
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e9162f>] llog_cat_id2handle+0x7f/0x620 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e92718>] llog_cat_cancel_records+0x128/0x3d0 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc1701a14>] llog_changelog_cancel_cb+0x104/0x2a0 [mdd]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e92bf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8c5af>] llog_process_thread+0x85f/0x1a10 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8d81c>] llog_process_or_fork+0xbc/0x450 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8f51e>] llog_cat_process_or_fork+0x17e/0x360 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc0e8f72e>] llog_cat_process+0x2e/0x30 [obdclass]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc1700a34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc1702e00>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc1705d63>] mdd_changelog_clear+0x653/0x7d0 [mdd]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc1708e43>] mdd_iocontrol+0x163/0x540 [mdd]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc158784c>] mdt_iocontrol+0x5ec/0xb00 [mdt]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc15881e4>] mdt_set_info+0x484/0x490 [mdt]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc11cb89a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc117073b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffc11740a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffff9b2c5da1>] kthread+0xd1/0xe0
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffff9b995df7>] ret_from_fork_nospec_end+0x0/0x39
Sep 06 17:53:22 csd3-mds2 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Sep 06 17:53:22 csd3-mds2 kernel: LustreError: dumping log to /tmp/lustre-log.1630947202.26758
Sep 06 17:55:14 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 17:55:14 csd3-mds2 kernel: Lustre: Skipped 54 previous similar messages
Sep 06 17:57:28 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.47.4.194@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 17:57:28 csd3-mds2 kernel: LustreError: Skipped 54 previous similar messages
Sep 06 17:58:54 csd3-mds2 kernel: LustreError: 26579:0:(llog_cat.c:767:llog_cat_cancel_records()) rds-d5-MDD0000: fail to cancel 1 of 1 llog-records: rc = -2
Sep 06 17:58:54 csd3-mds2 kernel: LustreError: 26579:0:(mdd_device.c:374:llog_changelog_cancel()) rds-d5-MDD0000: cancel idx 8997 of catalog [0x7:0xa:0x0]: rc = -2
Sep 06 17:58:54 csd3-mds2 kernel: Lustre: 26579:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (2720:2935s); client may timeout.  req@ffff9f70c0d2b600 x1709713587200960/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:637/0 lens 264/192 e 0 to 0 dl 1630944599 ref 1 fl Complete:/0/0 rc -2/-2
Sep 06 17:58:54 csd3-mds2 kernel: LNet: Service thread pid 26579 completed after 5655.68s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 06 17:59:01 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client d9f5d2f8-85b5-ba69-b1eb-d34c01481169 (at 10.43.101.13@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f7ee9ef2800, cur 1630947541 expire 1630947391 last 1630947314
Sep 06 18:01:16 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2) reconnecting
Sep 06 18:01:16 csd3-mds2 kernel: Lustre: Skipped 176 previous similar messages
Sep 06 18:03:45 csd3-mds2 kernel: Lustre: 26765:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630947818/real 1630947818]  req@ffff9f96bb343a80 x1709807844461888/t0(0) o104->rds-d4-MDT0000@10.47.20.109@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630947825 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:04:13 csd3-mds2 kernel: Lustre: 5455:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630947846/real 1630947846]  req@ffff9f965cae4c80 x1709807844509952/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630947853 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:04:13 csd3-mds2 kernel: Lustre: 5455:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Sep 06 18:05:17 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to eb4a3da9-f737-13ee-67d9-f09705e128d7 (at 10.43.101.13@tcp2)
Sep 06 18:05:17 csd3-mds2 kernel: Lustre: Skipped 214 previous similar messages
Sep 06 18:06:48 csd3-mds2 kernel: Lustre: 26821:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630948001/real 1630948001]  req@ffff9f9690318d80 x1709807844811136/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630948008 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:08:52 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:08:52 csd3-mds2 kernel: LustreError: Skipped 30 previous similar messages
Sep 06 18:11:17 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2) reconnecting
Sep 06 18:11:17 csd3-mds2 kernel: Lustre: Skipped 146 previous similar messages
Sep 06 18:12:55 csd3-mds2 kernel: Lustre: 13798:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630948368/real 1630948368]  req@ffff9f96140f9680 x1709807845393024/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630948375 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:14:30 csd3-mds2 kernel: Lustre: 13793:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-1010), not sending early reply
                                    req@ffff9f705f14b180 x1709713789529728/t0(0) o46->a2647704-cc85-a7e0-0bf2-95d98f0c7b96@10.43.240.198@tcp2:495/0 lens 264/224 e 0 to 0 dl 1630948475 ref 2 fl Interpret:/0/0 rc 0/0
Sep 06 18:15:18 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 18:15:18 csd3-mds2 kernel: Lustre: Skipped 137 previous similar messages
Sep 06 18:16:25 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 4 seconds
Sep 06 18:16:25 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 6, rc: 32
Sep 06 18:17:27 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: active_txs, 10 seconds
Sep 06 18:17:27 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 20, oc: 10, rc: 31
Sep 06 18:18:53 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.226@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:18:53 csd3-mds2 kernel: LustreError: Skipped 11 previous similar messages
Sep 06 18:21:32 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2) reconnecting
Sep 06 18:21:32 csd3-mds2 kernel: Lustre: Skipped 135 previous similar messages
Sep 06 18:21:55 csd3-mds2 kernel: Lustre: 26547:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630948906/real 1630948906]  req@ffff9f96111ed100 x1709807858207616/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630948915 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:21:55 csd3-mds2 kernel: Lustre: 26547:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 06 18:22:18 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 18:22:18 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.168@o2ib2 (0): c: 0, oc: 13, rc: 32
Sep 06 18:24:33 csd3-mds2 kernel: Lustre: 13828:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630949064/real 1630949064]  req@ffff9f704c2a7980 x1709807858557440/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630949073 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 18:24:33 csd3-mds2 kernel: Lustre: 26787:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630949064/real 1630949064]  req@ffff9f704c9d0900 x1709807858553664/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630949073 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 18:24:33 csd3-mds2 kernel: Lustre: 26753:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630949064/real 1630949064]  req@ffff9f704c2a7080 x1709807858568448/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630949073 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 18:24:33 csd3-mds2 kernel: Lustre: 26787:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 24 previous similar messages
Sep 06 18:24:33 csd3-mds2 kernel: Lustre: 26753:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 24 previous similar messages
Sep 06 18:25:34 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 18:25:34 csd3-mds2 kernel: Lustre: Skipped 158 previous similar messages
Sep 06 18:28:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 5 seconds
Sep 06 18:28:24 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 10, rc: 32
Sep 06 18:29:12 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:29:12 csd3-mds2 kernel: LustreError: Skipped 31 previous similar messages
Sep 06 18:31:46 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 095f929a-6781-7d79-e2a0-8f721baaa6c8 (at 10.43.101.8@tcp2) reconnecting
Sep 06 18:31:46 csd3-mds2 kernel: Lustre: Skipped 203 previous similar messages
Sep 06 18:34:46 csd3-mds2 kernel: Lustre: 26843:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630949671/real 1630949671]  req@ffff9f96a03e9b00 x1709807859589056/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630949686 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:34:46 csd3-mds2 kernel: Lustre: 26843:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 172 previous similar messages
Sep 06 18:35:43 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to 37f96c1e-b6c3-5759-ac7c-9836b7f67b6c (at 10.43.240.199@tcp2)
Sep 06 18:35:43 csd3-mds2 kernel: Lustre: Skipped 184 previous similar messages
Sep 06 18:39:05 csd3-mds2 kernel: LustreError: 26539:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.43.240.199@tcp2  ns: mdt-rds-d4-MDT0000_UUID lock: ffff9f969b265680/0x41c5233f868c0ba3 lrc: 3/0,0 mode: PR/PR res: [0x20005803e:0x12dfb:0x0].0x0 bits 0x1b/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.43.240.199@tcp2 remote: 0x2b7e3391c5b06d65 expref: 41862 pid: 26877 timeout: 444979 lvb_type: 0
Sep 06 18:39:05 csd3-mds2 kernel: LustreError: 26548:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f704db80000 x1709807860053632/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 18:39:05 csd3-mds2 kernel: LustreError: 26548:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
Sep 06 18:39:07 csd3-mds2 kernel: LustreError: 26809:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f96266d9680 x1709807860057216/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 18:39:07 csd3-mds2 kernel: LustreError: 26809:0:(client.c:1210:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Sep 06 18:39:11 csd3-mds2 kernel: LustreError: 26872:0:(client.c:1210:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9f70679b9f80 x1709807860062784/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
Sep 06 18:39:14 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:39:14 csd3-mds2 kernel: LustreError: Skipped 27 previous similar messages
Sep 06 18:41:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3383:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
Sep 06 18:41:39 csd3-mds2 kernel: LNetError: 10887:0:(o2iblnd_cb.c:3458:kiblnd_check_conns()) Timed out RDMA with 10.44.240.166@o2ib2 (0): c: 0, oc: 11, rc: 32
Sep 06 18:41:55 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 3a8863ee-ca52-d4f6-9cc2-92c2940094a5 (at 10.43.101.8@tcp2) reconnecting
Sep 06 18:41:55 csd3-mds2 kernel: Lustre: Skipped 131 previous similar messages
Sep 06 18:45:50 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 18:45:50 csd3-mds2 kernel: Lustre: Skipped 112 previous similar messages
Sep 06 18:48:08 csd3-mds2 kernel: Lustre: 26768:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630950481/real 1630950481]  req@ffff9f96c0e4a880 x1709807862459136/t0(0) o104->rds-d4-MDT0000@10.47.7.9@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1630950488 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 18:48:08 csd3-mds2 kernel: Lustre: 26768:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 25 previous similar messages
Sep 06 18:49:19 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.102.60@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:49:19 csd3-mds2 kernel: LustreError: Skipped 32 previous similar messages
Sep 06 18:52:08 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client fe5b1992-e80a-f120-3ac4-ff8c9d1c852c (at 10.47.1.122@o2ib1) reconnecting
Sep 06 18:52:08 csd3-mds2 kernel: Lustre: Skipped 162 previous similar messages
Sep 06 18:56:03 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to 0006b28d-ea4f-596b-ee3d-9aa6083e4de0 (at 10.43.101.11@tcp2)
Sep 06 18:56:03 csd3-mds2 kernel: Lustre: Skipped 184 previous similar messages
Sep 06 18:59:24 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.240.199@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 18:59:24 csd3-mds2 kernel: LustreError: Skipped 27 previous similar messages
Sep 06 19:01:38 csd3-mds2 kernel: LustreError: 27049:0:(ldlm_lib.c:3356:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff9f7a573ab850 x1709733981200768/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:326/0 lens 448/440 e 0 to 0 dl 1630951326 ref 1 fl Interpret:/0/0 rc 0/0
Sep 06 19:02:08 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client 757e46ad-31d8-d19c-62a1-cdf2d42a3e85 (at 10.43.102.60@tcp2) reconnecting
Sep 06 19:02:08 csd3-mds2 kernel: Lustre: Skipped 170 previous similar messages
Sep 06 19:03:11 csd3-mds2 kernel: Lustre: 13785:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630951384/real 1630951384]  req@ffff9f964428f080 x1709807863949696/t0(0) o104->rds-d5-MDT0000@10.43.240.198@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630951391 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 19:03:18 csd3-mds2 kernel: LustreError: 27011:0:(ldlm_lib.c:3346:target_bulk_io()) @@@ timeout on bulk READ after 100+1630504967s  req@ffff9f9680c5a880 x1709733981200768/t0(0) o37->37f96c1e-b6c3-5759-ac7c-9836b7f67b6c@10.43.240.199@tcp2:423/0 lens 448/440 e 2 to 0 dl 1630951423 ref 1 fl Interpret:/2/0 rc 0/0
Sep 06 19:06:05 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 19:06:05 csd3-mds2 kernel: Lustre: Skipped 259 previous similar messages
Sep 06 19:09:34 csd3-mds2 kernel: LustreError: 137-5: rds-d2-MDT0000_UUID: not available for connect from 10.43.22.3@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 19:09:34 csd3-mds2 kernel: LustreError: Skipped 44 previous similar messages
Sep 06 19:12:09 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client e0943ae8-1154-10f8-2692-a56587d60b7b (at 10.43.240.198@tcp2) reconnecting
Sep 06 19:12:09 csd3-mds2 kernel: Lustre: Skipped 226 previous similar messages
Sep 06 19:16:17 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Connection restored to cb798898-4507-0606-6393-d61ebc2d4576 (at 10.43.102.226@tcp2)
Sep 06 19:16:17 csd3-mds2 kernel: Lustre: Skipped 154 previous similar messages
Sep 06 19:20:02 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.43.240.198@tcp2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 19:20:02 csd3-mds2 kernel: LustreError: Skipped 33 previous similar messages
Sep 06 19:20:56 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: haven't heard from client 567cbfa3-3d7e-2d56-7f47-7fb5b70c53de (at 10.43.102.226@tcp2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9f96ad0aec00, cur 1630952456 expire 1630952306 last 1630952229
Sep 06 19:22:09 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Client e18a5844-89f5-8801-7b4c-48a6c526a063 (at 10.47.20.109@o2ib1) reconnecting
Sep 06 19:22:09 csd3-mds2 kernel: Lustre: Skipped 196 previous similar messages
Sep 06 19:26:22 csd3-mds2 kernel: Lustre: rds-d4-MDT0000: Connection restored to e18a5844-89f5-8801-7b4c-48a6c526a063 (at 10.47.20.109@o2ib1)
Sep 06 19:26:22 csd3-mds2 kernel: Lustre: Skipped 243 previous similar messages
Sep 06 19:26:22 csd3-mds2 kernel: Lustre: 26851:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630952773/real 1630952773]  req@ffff9f7021d33600 x1709807867584256/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630952782 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
Sep 06 19:26:22 csd3-mds2 kernel: Lustre: 26851:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Sep 06 19:27:37 csd3-mds2 kernel: Lustre: 26796:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630952848/real 1630952848]  req@ffff9f6ffc450480 x1709807867710784/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630952857 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Sep 06 19:27:37 csd3-mds2 kernel: Lustre: 26796:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 116 previous similar messages
Sep 06 19:30:18 csd3-mds2 kernel: LustreError: 137-5: rds-d3-MDT0000_UUID: not available for connect from 10.47.7.16@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Sep 06 19:30:18 csd3-mds2 kernel: LustreError: Skipped 29 previous similar messages
Sep 06 19:30:30 csd3-mds2 kernel: Lustre: 26544:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630953021/real 1630953021]  req@ffff9f7028f5f980 x1709807868032896/t0(0) o104->rds-d4-MDT0000@10.43.240.199@tcp2:15/16 lens 296/224 e 0 to 1 dl 1630953030 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 06 19:30:30 csd3-mds2 kernel: Lustre: 26544:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
Sep 06 19:32:12 csd3-mds2 kernel: Lustre: rds-d5-MDT0000: Client 80560c5c-3c7e-04a3-df3d-a9dcfe515124 (at 10.43.240.199@tcp2) reconnecting
Sep 06 19:32:12 csd3-mds2 kernel: Lustre: Skipped 202 previous similar messages
