[root@r3lead ~]# grep r3i0n11 /var/log/messages Apr 22 04:50:17 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886109999 sent from scratch1-OST0088-osc-ffff880330775400 to NID 10.174.31.213@o2ib 827s ago has timed out (827s prior to deadline). Apr 22 04:50:17 r3i0n11 kernel: req@ffff88061e2dc000 x1398900886109999/t0 o3->scratch1-OST0088_UUID@10.174.31.213@o2ib:6/4 lens 448/592 e 1 to 1 dl 1335070217 ref 2 fl Rpc:/2/0 rc -11/0 Apr 22 04:50:17 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 35 previous similar messages Apr 22 04:50:17 r3i0n11 kernel: Lustre: scratch1-OST0088-osc-ffff880330775400: Connection to service scratch1-OST0088 via nid 10.174.31.213@o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 22 04:50:17 r3i0n11 kernel: Lustre: Skipped 17 previous similar messages Apr 22 04:50:17 r3i0n11 kernel: LustreError: 11-0: an error occurred while communicating with 10.174.31.213@o2ib. The ost_connect operation failed with -16 Apr 22 04:50:17 r3i0n11 kernel: LustreError: Skipped 1 previous similar message Apr 22 04:50:24 r3i0n11 kernel: Lustre: scratch1-OST0088-osc-ffff880330775400: Connection restored to service scratch1-OST0088 using nid 10.174.31.213@o2ib. Apr 22 04:50:24 r3i0n11 kernel: Lustre: Skipped 18 previous similar messages Apr 22 04:50:59 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0084-osc-ffff880330775400: tried all connections, increasing latency to 2s Apr 22 04:50:59 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message Apr 22 04:51:15 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0084-osc-ffff880330775400: tried all connections, increasing latency to 3s Apr 22 04:51:15 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 5 previous similar messages Apr 22 04:51:29 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2912:kiblnd_check_txs()) Timed out tx: active_txs, 6 seconds Apr 22 04:51:29 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2975:kiblnd_check_conns()) Timed out RDMA with 10.174.31.213@o2ib (5) Apr 22 04:51:29 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583bf8000 Apr 22 04:51:33 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0084-osc-ffff880330775400: tried all connections, increasing latency to 4s Apr 22 04:51:33 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 5 previous similar messages Apr 22 04:52:22 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0085-osc-ffff880330775400: tried all connections, increasing latency to 7s Apr 22 04:52:22 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 15 previous similar messages Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2912:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2975:kiblnd_check_conns()) Timed out RDMA with 10.174.31.213@o2ib (11) Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8804ec692000 Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805e9888000 Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8803a8c52000 Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805319a8000 Apr 22 04:54:38 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583bf8000 Apr 22 04:55:26 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0085-osc-ffff880330775400: tried all connections, increasing latency to 25s Apr 22 04:55:26 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 10 previous similar messages Apr 22 04:59:41 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2912:kiblnd_check_txs()) Timed out tx: tx_queue, 2 seconds Apr 22 04:59:41 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2975:kiblnd_check_conns()) Timed out RDMA with 10.174.31.213@o2ib (12) Apr 22 04:59:41 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583bf8000 Apr 22 04:59:41 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805319a8000 Apr 22 04:59:41 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88060a9de000 Apr 22 05:00:18 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886131852 sent from scratch1-OST0084-osc-ffff880330775400 to NID 10.174.31.214@o2ib 65s ago has timed out (65s prior to deadline). Apr 22 05:00:18 r3i0n11 kernel: req@ffff88061e795000 x1398900886131852/t0 o400->scratch1-OST0084_UUID@10.174.31.214@o2ib:28/4 lens 192/384 e 0 to 1 dl 1335070818 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 22 05:00:18 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 135 previous similar messages Apr 22 05:00:30 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0084-osc-ffff880330775400: tried all connections, increasing latency to 25s Apr 22 05:00:30 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 10 previous similar messages Apr 22 05:00:30 r3i0n11 kernel: Lustre: scratch1-OST008d-osc-ffff880330775400: Connection restored to service scratch1-OST008d using nid 10.174.31.213@o2ib. Apr 22 05:00:30 r3i0n11 kernel: Lustre: Skipped 28 previous similar messages Apr 22 05:02:23 r3i0n11 kernel: LustreError: 167-0: This client was evicted by scratch1-OST0088; in progress operations using this service will fail. Apr 22 05:02:23 r3i0n11 kernel: LustreError: 29026:0:(rw.c:1341:ll_issue_page_read()) page ffffea0012dba478 map ffff8805e997a6f0 index 18432 flags c0000000000821 count 5 priv ffff8803fd898990: read queue failed: rc -5 Apr 22 05:02:23 r3i0n11 kernel: LustreError: 29222:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace scratch1-OST0088-osc-ffff880330775400 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Apr 22 05:02:23 r3i0n11 kernel: LustreError: 29222:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff8805681f3c80 (32701761/0/0/0) (rc: 1) Apr 22 05:03:56 r3i0n11 kernel: Lustre: scratch1-OST008c-osc-ffff880330775400: Connection to service scratch1-OST008c via nid 10.174.31.213@o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 22 05:03:56 r3i0n11 kernel: Lustre: Skipped 39 previous similar messages Apr 22 05:03:58 r3i0n11 kernel: LustreError: 11-0: an error occurred while communicating with 10.174.31.213@o2ib. The ost_connect operation failed with -16 Apr 22 05:03:58 r3i0n11 kernel: LustreError: Skipped 2 previous similar messages Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805ac2c6000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880584a1a000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805d2164000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8804f259e000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8803cbc46000 Apr 22 05:08:12 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583f72000 Apr 22 05:09:14 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST008a-osc-ffff880330775400: tried all connections, increasing latency to 25s Apr 22 05:09:14 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 13 previous similar messages Apr 22 05:10:13 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2912:kiblnd_check_txs()) Timed out tx: active_txs, 4 seconds Apr 22 05:10:13 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2975:kiblnd_check_conns()) Timed out RDMA with 10.174.31.213@o2ib (24) Apr 22 05:10:13 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805cc55a000 Apr 22 05:10:13 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805d2164000 Apr 22 05:10:32 r3i0n11 kernel: Lustre: scratch1-OST008b-osc-ffff880330775400: Connection restored to service scratch1-OST008b using nid 10.174.31.213@o2ib. Apr 22 05:10:32 r3i0n11 kernel: Lustre: Skipped 22 previous similar messages Apr 22 05:12:13 r3i0n11 kernel: LustreError: 167-0: This client was evicted by scratch1-OST0088; in progress operations using this service will fail. Apr 22 05:22:20 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886160020 sent from scratch1-OST008e-osc-ffff880330775400 to NID 10.174.31.213@o2ib 782s ago has timed out (782s prior to deadline). Apr 22 05:22:20 r3i0n11 kernel: req@ffff880630922000 x1398900886160020/t0 o3->scratch1-OST008e_UUID@10.174.31.213@o2ib:6/4 lens 448/592 e 0 to 1 dl 1335072140 ref 2 fl Rpc:/2/0 rc -11/0 Apr 22 05:22:20 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 56 previous similar messages Apr 22 05:22:20 r3i0n11 kernel: Lustre: scratch1-OST008e-osc-ffff880330775400: Connection to service scratch1-OST008e via nid 10.174.31.213@o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 22 05:22:20 r3i0n11 kernel: Lustre: Skipped 18 previous similar messages Apr 22 05:22:20 r3i0n11 kernel: Lustre: scratch1-OST008e-osc-ffff880330775400: Connection restored to service scratch1-OST008e using nid 10.174.31.213@o2ib. Apr 22 05:22:20 r3i0n11 kernel: Lustre: Skipped 6 previous similar messages Apr 22 05:23:04 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0084-osc-ffff880330775400: tried all connections, increasing latency to 2s Apr 22 05:23:04 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 8 previous similar messages Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583f72000 Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8803cbc46000 Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880584a1a000 Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8804f268c000 Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805d2164000 Apr 22 05:23:34 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:24:11 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886224580 sent from scratch1-OST0084-osc-ffff880330775400 to NID 10.174.31.213@o2ib 23s ago has timed out (23s prior to deadline). Apr 22 05:24:11 r3i0n11 kernel: req@ffff88062c7db400 x1398900886224580/t0 o400->scratch1-OST0084_UUID@10.174.31.213@o2ib:28/4 lens 192/384 e 0 to 1 dl 1335072251 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 22 05:24:11 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 48 previous similar messages Apr 22 05:24:31 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2912:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds Apr 22 05:24:31 r3i0n11 kernel: LustreError: 4557:0:(o2iblnd_cb.c:2975:kiblnd_check_conns()) Timed out RDMA with 10.174.31.213@o2ib (15) Apr 22 05:24:31 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8804f268c000 Apr 22 05:24:31 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8805d2164000 Apr 22 05:24:31 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:26:58 r3i0n11 MPI[29260]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 05:27:09 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886227412 sent from scratch1-OST0086-osc-ffff880330775400 to NID 10.174.31.213@o2ib 26s ago has timed out (25s prior to deadline). Apr 22 05:27:09 r3i0n11 kernel: req@ffff8804522acc00 x1398900886227412/t0 o400->scratch1-OST0086_UUID@10.174.31.213@o2ib:28/4 lens 192/384 e 0 to 1 dl 1335072428 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 22 05:27:09 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 24 previous similar messages Apr 22 05:27:59 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:27:59 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880584a1a000 Apr 22 05:27:59 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8804f268c000 Apr 22 05:27:59 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff880583f72000 Apr 22 05:30:56 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:30:56 r3i0n11 kernel: LustreError: 11-0: an error occurred while communicating with 10.174.31.213@o2ib. The ost_connect operation failed with -16 Apr 22 05:30:56 r3i0n11 kernel: LustreError: Skipped 2 previous similar messages Apr 22 05:32:41 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1398900886234457 sent from scratch1-OST0086-osc-ffff880330775400 to NID 10.174.31.213@o2ib 33s ago has timed out (33s prior to deadline). Apr 22 05:32:41 r3i0n11 kernel: req@ffff8805e9858800 x1398900886234457/t0 o400->scratch1-OST0086_UUID@10.174.31.213@o2ib:28/4 lens 192/384 e 0 to 1 dl 1335072761 ref 1 fl Rpc:N/0/0 rc 0/0 Apr 22 05:32:41 r3i0n11 kernel: Lustre: 4558:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 71 previous similar messages Apr 22 05:32:41 r3i0n11 kernel: Lustre: scratch1-OST0086-osc-ffff880330775400: Connection to service scratch1-OST0086 via nid 10.174.31.213@o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 22 05:32:41 r3i0n11 kernel: Lustre: Skipped 44 previous similar messages Apr 22 05:32:49 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88039f532000 Apr 22 05:33:07 r3i0n11 kernel: Lustre: scratch1-OST0084-osc-ffff880330775400: Connection restored to service scratch1-OST0084 using nid 10.174.31.213@o2ib. Apr 22 05:33:07 r3i0n11 kernel: Lustre: Skipped 44 previous similar messages Apr 22 05:33:11 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) scratch1-OST0086-osc-ffff880330775400: tried all connections, increasing latency to 10s Apr 22 05:33:11 r3i0n11 kernel: Lustre: 4560:0:(import.c:517:import_select_connection()) Skipped 70 previous similar messages Apr 22 05:35:46 r3i0n11 kernel: LustreError: 4557:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8803cbc46000 Apr 22 05:35:46 r3i0n11 kernel: LustreError: 11-0: an error occurred while communicating with 10.174.31.213@o2ib. The ost_connect operation failed with -16 Apr 22 05:35:46 r3i0n11 kernel: LustreError: Skipped 1 previous similar message Apr 22 06:15:25 r3i0n11 PROLOGUE: Starting prologue for job 972394.bqs1.zeus.fairmont.rdhpcs.noaa.gov Apr 22 06:42:45 r3i0n11 EPILOGUE: Command Line: 972394.bqs1.zeus.fairmont.rdhpcs.noaa.gov Xiaqiong.Zhou ensemble uy2011092100178 29972 neednodes=1,nodes=1,procs=8,walltime=01:30:00 cput=00:13:44,mem=4516368kb,vmem=8435920kb,walltime=00:27:20 batch gefs 0 Apr 22 06:42:45 r3i0n11 EPILOGUE: Job 972394.bqs1.zeus.fairmont.rdhpcs.noaa.gov finished for user Xiaqiong.Zhou queue batch with exit code 0 Apr 22 06:42:45 r3i0n11 EPILOGUE: Checking OOM .... Apr 22 06:42:46 r3i0n11 EPILOGUE: No OOM errors or BUG found. Executing normal cleanup. Apr 22 06:42:46 r3i0n11 EPILOGUE: spawn(): launching: ssh -o StrictHostKeyChecking=no -o ConnectTimeout=6 r3i0n11 /var/spool/torque/mom_priv/chk_node.pl -s Elog Apr 22 06:42:46 r3i0n11 EPILOGUE: Finished: Job 972394.bqs1.zeus.fairmont.rdhpcs.noaa.gov finished for user Xiaqiong.Zhou queue batch with exit code 0 Apr 22 08:08:45 r3i0n11 MPI[13163]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 08:39:18 r3i0n11 MPI[13345]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 09:33:00 r3i0n11 MPI[13907]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 12:24:24 r3i0n11 MPI[14722]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 14:02:50 r3i0n11 MPI[7722]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 15:22:32 r3i0n11 MPI[11745]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 15:56:07 r3i0n11 MPI[12284]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 22:15:03 r3i0n11 MPI[18209]: mpirun interrupted - MPI daemon terminating job#012 Apr 22 22:20:34 r3i0n11 MPI[18321]: mpirun interrupted - MPI daemon terminating job#012 Apr 23 01:30:43 r3i0n11 MPI[21005]: mpirun interrupted - MPI daemon terminating job#012 Apr 23 01:37:45 r3i0n11 MPI[21193]: mpirun interrupted - MPI daemon terminating job#012 Apr 23 01:43:38 r3i0n11 MPI[21299]: mpirun interrupted - MPI daemon terminating job#012 Apr 23 07:32:41 r3i0n11 ntpd[3156]: no servers reachable Apr 23 07:36:59 r3i0n11 ntpd[3156]: synchronized to 192.168.159.1, stratum 3 Apr 24 20:17:00 r3i0n11 /usr/sbin/arrayd[1892]: REQUEST REMEXT(19) from root@fe6.zeus.fairmont.rdhpcs.noaa.gov exec by user root, cmd='cd '/root'; exec '/usr/mpi/mpitests_mpt/tests/IMB-2.3/IMB-MPI1'' Apr 24 21:27:25 r3i0n11 MPI[1892]: mpirun interrupted - MPI daemon terminating job#012