Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this
Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real
Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages
1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real
1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages
Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Apr 12 02:07:14 homeoss1 kernel:
Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390
Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418
Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4
Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 12 02:28:07 homeoss1 crmd: [5195]: notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: te_graph_trigger: Transition 256 is now complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: notify_crmd: Transition 256 status: done - <null>
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: Starting PEngine Recheck Timer
Apr 12 02:31:51 homeoss1 cib: [5191]: info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min
Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start
Also during last reboot on 12 April 14, we have observed following error messages, I f we can relate something with this
Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248434/real 1397248435] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248441 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr 12 02:03:55 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248435/real
Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248437/real 1397248437] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248444 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:57 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 73468 previous similar messages
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248439/real 1397248439] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248446 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:03:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 162386 previous similar messages
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248443/real 1397248443] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248450 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 296701 previous similar messages
1397248467] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248474 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:27 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 1301167 previous similar messages
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248499/real 1397248499] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248506 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:04:59 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 2602376 previous similar messages
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1397248563/real
1397248563] req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248570 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:06:03 homeoss1 kernel: Lustre: 7390:0:(client.c:1788:ptlrpc_expire_one_request()) Skipped 5199842 previous similar messages
Apr 12 02:07:14 homeoss1 kernel: Lustre: Service thread pid 7390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 12 02:07:14 homeoss1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Apr 12 02:07:14 homeoss1 kernel:
Apr 12 02:07:14 homeoss1 kernel: LustreError: dumping log to /tmp/lustre-log.1397248634.7390
Apr 12 02:07:25 homeoss1 kernel: Lustre: home-OST0003: haven't heard from client c0bee620-b606-8adf-dadd-0d895330a3fd (at 10.2.1.252@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88090ff17000, cur 1397248645 expire 1397248495 last 1397248418
Apr 12 02:07:25 homeoss1 kernel: LustreError: 7390:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8804e2e86c00 x1464903367127640/t0(0) o106->home-OST0005@10.2.1.252@o2ib:15/16 lens 296/232 e 0 to 1 dl 1397248652 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 12 02:07:25 homeoss1 kernel: LustreError: 138-a: home-OST0005: A client on nid 10.2.1.252@o2ib was evicted due to a lock glimpse callback time out: rc -4
Apr 12 02:07:25 homeoss1 kernel: Lustre: Service thread pid 7390 completed after 210.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Apr 12 02:28:07 homeoss1 crmd: [5195]: notice: run_graph: Transition 256 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-253.bz2): Complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: te_graph_trigger: Transition 256 is now complete
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: notify_crmd: Transition 256 status: done - <null>
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 12 02:28:07 homeoss1 crmd: [5195]: info: do_state_transition: Starting PEngine Recheck Timer
Apr 12 02:31:51 homeoss1 cib: [5191]: info: cib_stats: Processed 1 operations (10000.00us average, 0% utilization) in the last 10min
Apr 12 02:38:25 homeoss1 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Apr 12 02:38:25 homeoss1 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="7609" x-info="http://www.rsyslog.com"] (re)start