Hm, this is a strange message in the servers:
Jan 15 19:13:24 nbp2-oss18 kernel: LustreError: 13505:0:(events.c:452:server_bulk_callback()) event type 5, status -103, desc ffff881b1a708000
Jan 15 19:13:24 nbp2-oss18 kernel: Lustre: nbp2-OST0075: Bulk IO read error with 9a6a5394-9d0c-107d-b924-82de647f4613 (at 10.151.27.95@o2ib), client will retry: rc -110
So something causes those bulk transfers to get aborted. rc -110 is also etimeout. (-103 is connection aborted).
And then this:
Jan 15 19:13:24 nbp2-oss18 kernel: Lustre: 21888:0:(service.c:2050:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (192:4547s); client may timeout. req@ffff881b110a2400 x1522760681448372/t0(0) o3->9a6a5394-9d0c-107d-b924-82de647f4613@10.151.27.95@o2ib:0/0 lens 488/432 e 0 to 0 dl 1452909414 ref 1 fl Complete:/0/ffffffff rc 0/-1
So this means server threads are spending lots of time processing this request, o3 is READ, so all those bulk timeouts are probably causing read RPCs to fail, and take long time at that. When that happens, the client whose request got stuck like that would be complaining about server unresponsiveness and will be reconnecting.
So the root cause is somewhere in the bulk IO errors.
In the log attached to this ticket we can see stuff like:
00000800:00000100:4.0F:1452910553.468033:0:2802:0:(o2iblnd_cb.c:2903:kiblnd_cm_callback()) 10.151.54.85@o2ib: UNREACHABLE -110
00000800:00000100:4.0:1452910553.563037:0:2802:0:(o2iblnd_cb.c:2903:kiblnd_cm_callback()) 10.151.0.196@o2ib: UNREACHABLE -110
00000800:00000100:4.0:1452910553.563045:0:2802:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting messages for 10.151.0.196@o2ib: connection failed
Assuming this is one of the clients, I imagine you are just having some sort of a network problem where some of the messages cannot get through?
I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can't really help because current thread wouldn't sleep. I have updated the patch uploaded by Amir (http://review.whamcloud.com/#/c/16470/), I also ported it to 2_5_fe (http://review.whamcloud.com/18026)