Console [vesta-mds1] log at 2013-05-21 20:00:00 PDT. Console [vesta-mds1] log at 2013-05-21 21:00:00 PDT. Console [vesta-mds1] log at 2013-05-21 22:00:00 PDT. Console [vesta-mds1] log at 2013-05-21 23:00:00 PDT. Console [vesta-mds1] log at 2013-05-22 00:00:00 PDT. 2013-05-22 00:00:04 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:00:04 Lustre: Skipped 1 previous similar message 2013-05-22 00:00:04 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) refused reconnection, still busy with 1 active RPCs 2013-05-22 00:00:05 LustreError: 18414:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff88082ccbf000 x1435205784177962/t0(0) o37->35e5ce04-5d7c-3947-0ac7-6d990e7e62e7@172.16.66.31@tcp:0/0 lens 416/440 e 0 to 0 dl 1369206065 ref 1 fl Interpret:/0/0 rc 0/0 2013-05-22 00:00:29 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:02:09 LustreError: 19172:0:(ldlm_lib.c:2709:target_bulk_io()) @@@ timeout on bulk PUT after 100+0s req@ffff8800709f7800 x1435205784183926/t0(0) o37->35e5ce04-5d7c-3947-0ac7-6d990e7e62e7@172.16.66.31@tcp:0/0 lens 416/440 e 0 to 0 dl 1369206129 ref 1 fl Interpret:/2/0 rc 0/0 2013-05-22 00:02:09 LustreError: 19026:0:(ldlm_lib.c:2709:target_bulk_io()) @@@ timeout on bulk PUT after 100+0s req@ffff8800af7ec400 x1435205784183935/t0(0) o37->35e5ce04-5d7c-3947-0ac7-6d990e7e62e7@172.16.66.31@tcp:0/0 lens 416/440 e 0 to 0 dl 1369206129 ref 1 fl Interpret:/0/0 rc 0/0 2013-05-22 00:02:19 Lustre: MGS: Client c75fed24-e951-aa5b-9041-1bb9e7a43c46 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:03:23 LNetError: 16666:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 9 seconds 2013-05-22 00:03:23 LNetError: 16666:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 172.20.5.106@o2ib500 (0): c: 0, oc: 0, rc: 8 2013-05-22 00:03:53 LustreError: 18442:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc900e4c4bf10 buffer[5] len 5: strlen 0 2013-05-22 00:03:53 LustreError: 18442:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (5 of 6) in format `LDLM_INTENT_GETATTR': 5 vs. 0 (client) 2013-05-22 00:03:53 req@ffff880e88598000 x1435305191414308/t0(0) o101->95c187f7-33eb-883d-787c-93301dc64ea5@172.20.16.10@o2ib500:0/0 lens 576/3304 e 0 to 0 dl 1369206294 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2013-05-22 00:03:59 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:05:20 Lustre: MGS: Client c75fed24-e951-aa5b-9041-1bb9e7a43c46 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:05:20 Lustre: Skipped 2 previous similar messages 2013-05-22 00:09:06 LNet: Service thread pid 19045 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-22 00:09:06 Pid: 19045, comm: mdt_rdpg02_013 2013-05-22 00:09:06 2013-05-22 00:09:06 Call Trace: 2013-05-22 00:09:06 [] ? lock_timer_base+0x3c/0x70 2013-05-22 00:09:06 [] schedule_timeout+0x192/0x2e0 2013-05-22 00:09:06 [] ? process_timeout+0x0/0x10 2013-05-22 00:09:06 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-22 00:09:06 [] target_bulk_io+0x3b8/0x910 [ptlrpc] 2013-05-22 00:09:06 [] ? default_wake_function+0x0/0x20 2013-05-22 00:09:06 [] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc] 2013-05-22 00:09:06 [] ? mdd_dir_page_build+0x0/0x210 [mdd] 2013-05-22 00:09:06 [] mdt_sendpage+0x10d/0x240 [mdt] 2013-05-22 00:09:06 [] mdt_readpage+0x497/0x960 [mdt] 2013-05-22 00:09:06 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-22 00:09:06 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-22 00:09:06 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-22 00:09:06 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-22 00:09:06 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-22 00:09:06 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-22 00:09:06 [] ? __wake_up+0x53/0x70 2013-05-22 00:09:06 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-22 00:09:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-22 00:09:06 [] child_rip+0xa/0x20 2013-05-22 00:09:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-22 00:09:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-22 00:09:06 [] ? child_rip+0x0/0x20 2013-05-22 00:09:06 2013-05-22 00:09:06 LustreError: dumping log to /tmp/lustre-log.1369206546.19045 2013-05-22 00:09:56 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:09:56 Lustre: Skipped 1 previous similar message 2013-05-22 00:09:56 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) refused reconnection, still busy with 1 active RPCs 2013-05-22 00:09:56 LustreError: 19045:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff88050fda4c00 x1435205784229566/t0(0) o37->35e5ce04-5d7c-3947-0ac7-6d990e7e62e7@172.16.66.31@tcp:0/0 lens 416/440 e 1 to 0 dl 1369206616 ref 1 fl Interpret:/0/0 rc 0/0 2013-05-22 00:09:56 LNet: Service thread pid 19045 completed after 250.05s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-22 00:12:07 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) reconnecting 2013-05-22 00:12:07 Lustre: Skipped 1 previous similar message 2013-05-22 00:12:07 Lustre: fsv-MDT0000: Client 35e5ce04-5d7c-3947-0ac7-6d990e7e62e7 (at 172.16.66.31@tcp) refused reconnection, still busy with 1 active RPCs 2013-05-22 00:12:07 LustreError: 19045:0:(ldlm_lib.c:2718:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff88101af4d850 x1435205784276370/t0(0) o37->35e5ce04-5d7c-3947-0ac7-6d990e7e62e7@172.16.66.31@tcp:0/0 lens 416/440 e 0 to 0 dl 1369206938 ref 1 fl Interpret:/0/0 rc 0/0 2013-05-22 00:13:29 LNetError: 16666:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds 2013-05-22 00:13:29 LNetError: 16666:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 172.20.5.106@o2ib500 (0): c: 0, oc: 1, rc: 8 2013-05-22 00:14:25 LustreError: 19044:0:(mdt_open.c:1098:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed: 2013-05-22 00:14:25 LustreError: 19044:0:(mdt_open.c:1098:mdt_reconstruct_open()) LBUG