Console [vesta-mds1] log at 2013-05-19 00:00:00 PDT. 2013-05-19 00:29:39 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 00:29:39 Lustre: Skipped 13 previous similar messages 2013-05-19 00:29:39 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 00:29:39 Lustre: Skipped 1 previous similar message 2013-05-19 00:30:02 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) reconnecting 2013-05-19 00:30:02 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 00:30:27 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) reconnecting 2013-05-19 00:30:27 Lustre: Skipped 1 previous similar message 2013-05-19 00:30:53 LustreError: 19245:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc900b7e13b18 buffer[5] len 114: strlen 0 2013-05-19 00:30:53 LustreError: 19245:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (5 of 6) in format `LDLM_INTENT_GETATTR': 114 vs. 0 (client) 2013-05-19 00:30:53 req@ffff8802ab573000 x1435204922666104/t0(0) o101->9f5c433f-56fd-fbb5-3d26-a182144680ca@172.20.16.11@o2ib500:0/0 lens 688/3304 e 0 to 0 dl 1368948899 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Console [vesta-mds1] log at 2013-05-19 01:00:00 PDT. Console [vesta-mds1] log at 2013-05-19 02:00:00 PDT. 2013-05-19 02:18:25 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 02:18:25 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:18:50 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 02:18:50 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:19:12 LNet: Service thread pid 19260 was inactive for 258.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 02:19:12 Pid: 19260, comm: mdt00_018 2013-05-19 02:19:12 2013-05-19 02:19:12 Call Trace: 2013-05-19 02:19:12 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 02:19:12 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 02:19:12 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 02:19:12 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 02:19:12 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 02:19:12 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 02:19:12 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 02:19:12 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 02:19:12 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 02:19:12 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 02:19:12 [] mdd_unlink+0x40e/0xe20 [mdd] 2013-05-19 02:19:12 [] mdo_unlink+0x18/0x50 [mdt] 2013-05-19 02:19:12 [] mdt_reint_unlink+0x739/0xfd0 [mdt] 2013-05-19 02:19:12 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 02:19:12 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 02:19:12 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 02:19:12 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 02:19:12 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 02:19:12 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 02:19:12 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 02:19:12 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 02:19:12 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 02:19:12 [] ? __wake_up+0x53/0x70 2013-05-19 02:19:12 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 02:19:12 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 02:19:12 [] child_rip+0xa/0x20 2013-05-19 02:19:12 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 02:19:12 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 02:19:12 [] ? child_rip+0x0/0x20 2013-05-19 02:19:12 2013-05-19 02:19:12 LustreError: dumping log to /tmp/lustre-log.1368955152.19260 2013-05-19 02:19:13 LNet: Service thread pid 19260 completed after 258.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 02:19:15 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 02:45:48 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 02:45:48 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:45:53 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 02:45:53 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:46:13 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 02:46:17 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:46:42 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 02:46:42 Lustre: Skipped 1 previous similar message 2013-05-19 02:46:42 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 02:47:07 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting Console [vesta-mds1] log at 2013-05-19 03:00:00 PDT. 2013-05-19 03:08:37 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 03:08:37 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:09:02 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 03:09:12 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 03:09:12 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:09:37 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 03:09:37 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:09:55 LNet: Service thread pid 19280 was inactive for 524.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 03:09:55 Pid: 19280, comm: mdt03_024 2013-05-19 03:09:55 2013-05-19 03:09:55 Call Trace: 2013-05-19 03:09:55 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 03:09:55 [] ? wake_up_process+0x15/0x20 2013-05-19 03:09:55 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 03:09:55 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 03:09:55 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 03:09:55 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 03:09:55 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 03:09:55 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 03:09:55 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 03:09:55 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 03:09:55 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 03:09:55 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 03:09:55 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 03:09:55 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 03:09:55 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 03:09:55 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 03:09:55 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 03:09:55 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 03:09:55 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 03:09:55 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 03:09:55 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 03:09:55 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 03:09:55 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 03:09:55 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 03:09:55 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 03:09:55 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 03:09:55 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 03:09:55 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 03:09:55 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 03:09:55 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 03:09:55 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 03:09:55 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 03:09:55 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 03:09:55 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 03:09:55 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 03:09:55 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 03:09:55 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 03:09:55 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 03:09:55 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 03:09:55 [] ? default_wake_function+0x0/0x20 2013-05-19 03:09:55 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 03:09:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:09:55 [] child_rip+0xa/0x20 2013-05-19 03:09:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:09:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:09:55 [] ? child_rip+0x0/0x20 2013-05-19 03:09:55 2013-05-19 03:09:55 LustreError: dumping log to /tmp/lustre-log.1368958195.19280 2013-05-19 03:10:02 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:10:16 LNet: Service thread pid 19389 was inactive for 304.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 03:10:16 Pid: 19389, comm: mdt_rdpg01_008 2013-05-19 03:10:16 2013-05-19 03:10:16 Call Trace: 2013-05-19 03:10:16 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 03:10:16 [] ? wake_up_process+0x15/0x20 2013-05-19 03:10:16 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 03:10:16 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 03:10:16 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 03:10:16 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 03:10:16 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 03:10:16 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 03:10:16 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 03:10:16 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 03:10:16 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 03:10:16 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 03:10:16 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 03:10:16 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 03:10:16 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 03:10:16 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 03:10:16 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 03:10:16 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 03:10:16 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 03:10:16 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 03:10:16 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 03:10:16 [] ? __wake_up+0x53/0x70 2013-05-19 03:10:16 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 03:10:16 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:10:16 [] child_rip+0xa/0x20 2013-05-19 03:10:16 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:10:16 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 03:10:16 [] ? child_rip+0x0/0x20 2013-05-19 03:10:16 2013-05-19 03:10:16 LustreError: dumping log to /tmp/lustre-log.1368958216.19389 2013-05-19 03:10:27 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 03:10:27 Lustre: Skipped 1 previous similar message 2013-05-19 03:10:27 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:10:52 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:11:17 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:11:42 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 03:11:42 Lustre: Skipped 2 previous similar messages 2013-05-19 03:12:07 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:12:07 Lustre: Skipped 1 previous similar message 2013-05-19 03:12:15 Lustre: 19024:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-64), not sending early reply 2013-05-19 03:12:15 req@ffff880bb630a000 x1434323172404415/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1368958345 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 03:12:43 LNet: Service thread pid 19389 completed after 451.42s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 03:13:18 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:13:18 Lustre: Skipped 3 previous similar messages 2013-05-19 03:14:08 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) reconnecting 2013-05-19 03:14:08 Lustre: Skipped 8 previous similar messages 2013-05-19 03:15:33 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 03:15:33 Lustre: Skipped 3 previous similar messages 2013-05-19 03:16:26 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 4470s: evicting client at 172.20.5.108@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880bd1112e00/0x48db03b06958d372 lrc: 3/0,0 mode: PR/PR res: 8589945364/130387 bits 0x13 rrc: 4 type: IBT flags: 0x200000000020 nid: 172.20.5.108@o2ib500 remote: 0x94f247df1467a5fe expref: 43881 pid: 19280 timeout: 4348762199 lvb_type: 0 2013-05-19 03:17:40 Lustre: 19280:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (674:315s); client may timeout. req@ffff880bb630a000 x1434323172404415/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1368958345 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 03:17:40 LNet: Service thread pid 19280 completed after 989.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 04:00:00 PDT. 2013-05-19 04:16:17 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) reconnecting 2013-05-19 04:16:17 Lustre: Skipped 5 previous similar messages 2013-05-19 04:16:17 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 04:16:17 Lustre: Skipped 2 previous similar messages 2013-05-19 04:17:01 LNet: Service thread pid 19956 was inactive for 248.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 04:17:01 Pid: 19956, comm: mdt02_062 2013-05-19 04:17:01 2013-05-19 04:17:01 Call Trace: 2013-05-19 04:17:01 [] schedule_timeout+0x192/0x2e0 2013-05-19 04:17:01 [] ? process_timeout+0x0/0x10 2013-05-19 04:17:01 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 04:17:01 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 04:17:01 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 04:17:01 [] ? default_wake_function+0x0/0x20 2013-05-19 04:17:01 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 04:17:01 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 04:17:01 [] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2013-05-19 04:17:01 [] mdt_reint_rename+0x214/0x1b10 [mdt] 2013-05-19 04:17:01 [] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2013-05-19 04:17:01 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 04:17:01 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 04:17:01 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 04:17:01 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 04:17:01 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 04:17:01 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 04:17:01 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 04:17:01 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 04:17:01 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 04:17:01 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 04:17:01 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 04:17:01 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 04:17:01 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 04:17:01 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 04:17:01 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 04:17:01 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 04:17:01 [] ? default_wake_function+0x0/0x20 2013-05-19 04:17:01 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 04:17:01 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:01 [] child_rip+0xa/0x20 2013-05-19 04:17:01 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:01 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:01 [] ? child_rip+0x0/0x20 2013-05-19 04:17:01 2013-05-19 04:17:01 LustreError: dumping log to /tmp/lustre-log.1368962221.19956 2013-05-19 04:17:02 LNet: Service thread pid 19293 was inactive for 248.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 04:17:02 Pid: 19293, comm: mdt02_029 2013-05-19 04:17:02 2013-05-19 04:17:02 Call Trace: 2013-05-19 04:17:02 [] schedule_timeout+0x192/0x2e0 2013-05-19 04:17:02 [] ? process_timeout+0x0/0x10 2013-05-19 04:17:02 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 04:17:02 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 04:17:02 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 04:17:02 [] ? default_wake_function+0x0/0x20 2013-05-19 04:17:02 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 04:17:02 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 04:17:02 [] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2013-05-19 04:17:02 [] mdt_reint_rename+0x214/0x1b10 [mdt] 2013-05-19 04:17:02 [] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2013-05-19 04:17:02 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 04:17:02 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 04:17:02 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 04:17:02 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 04:17:02 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 04:17:02 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 04:17:02 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 04:17:02 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 04:17:02 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 04:17:02 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 04:17:02 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 04:17:02 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 04:17:02 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 04:17:02 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 04:17:02 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 04:17:02 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 04:17:02 [] ? __wake_up+0x53/0x70 2013-05-19 04:17:02 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 04:17:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:02 [] child_rip+0xa/0x20 2013-05-19 04:17:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:17:02 [] ? child_rip+0x0/0x20 2013-05-19 04:17:02 2013-05-19 04:17:02 LustreError: dumping log to /tmp/lustre-log.1368962222.19293 2013-05-19 04:17:06 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 04:17:06 Lustre: Skipped 8 previous similar messages 2013-05-19 04:17:06 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 04:17:06 Lustre: Skipped 7 previous similar messages 2013-05-19 04:17:17 LNet: Service thread pid 19956 completed after 264.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 04:17:17 LNet: Skipped 1 previous similar message 2013-05-19 04:27:24 Lustre: fsv-MDT0000: Client b1cc8a79-fc54-6daf-240e-142cdd81e769 (at 172.20.17.17@o2ib500) reconnecting 2013-05-19 04:27:24 Lustre: Skipped 11 previous similar messages 2013-05-19 04:27:24 Lustre: fsv-MDT0000: Client b1cc8a79-fc54-6daf-240e-142cdd81e769 (at 172.20.17.17@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 04:27:24 Lustre: Skipped 5 previous similar messages 2013-05-19 04:34:15 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) reconnecting 2013-05-19 04:34:15 Lustre: Skipped 3 previous similar messages 2013-05-19 04:34:15 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 04:34:15 Lustre: Skipped 1 previous similar message 2013-05-19 04:48:08 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) reconnecting 2013-05-19 04:48:08 Lustre: Skipped 1 previous similar message 2013-05-19 04:48:08 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 04:49:55 LNet: Service thread pid 18392 was inactive for 238.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 04:49:55 Pid: 18392, comm: mdt_rdpg02_000 2013-05-19 04:49:55 2013-05-19 04:49:55 Call Trace: 2013-05-19 04:49:55 [] ? __mutex_lock_slowpath+0x70/0x180 2013-05-19 04:49:55 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 04:49:55 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 04:49:55 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 04:49:55 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 04:49:55 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 04:49:55 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 04:49:55 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 04:49:55 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 04:49:55 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 04:49:55 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 04:49:55 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 04:49:55 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 04:49:55 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 04:49:55 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 04:49:55 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 04:49:55 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 04:49:55 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 04:49:55 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 04:49:55 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 04:49:55 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 04:49:55 [] ? __wake_up+0x53/0x70 2013-05-19 04:49:55 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 04:49:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:49:55 [] child_rip+0xa/0x20 2013-05-19 04:49:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:49:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 04:49:55 [] ? child_rip+0x0/0x20 2013-05-19 04:49:55 2013-05-19 04:49:55 LustreError: dumping log to /tmp/lustre-log.1368964195.18392 2013-05-19 04:50:34 LNet: Service thread pid 18392 completed after 277.16s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 05:00:00 PDT. 2013-05-19 05:01:05 LNet: Service thread pid 19317 was inactive for 576.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:01:05 Pid: 19317, comm: mdt03_032 2013-05-19 05:01:05 2013-05-19 05:01:05 Call Trace: 2013-05-19 05:01:05 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 05:01:05 [] ? wake_up_process+0x15/0x20 2013-05-19 05:01:05 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:01:05 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:01:05 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:01:05 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:01:05 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:01:05 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:01:05 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:01:05 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 05:01:05 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 05:01:05 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 05:01:05 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 05:01:05 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 05:01:05 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 05:01:05 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 05:01:05 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 05:01:05 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 05:01:05 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 05:01:05 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 05:01:05 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 05:01:05 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 05:01:05 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 05:01:05 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 05:01:05 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 05:01:05 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 05:01:05 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 05:01:05 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 05:01:05 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 05:01:06 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 05:01:06 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 05:01:06 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 05:01:06 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 05:01:06 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:01:06 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 05:01:06 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:01:06 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:01:06 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:01:06 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:01:06 [] ? __wake_up+0x53/0x70 2013-05-19 05:01:06 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:01:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:01:06 [] child_rip+0xa/0x20 2013-05-19 05:01:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:01:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:01:06 [] ? child_rip+0x0/0x20 2013-05-19 05:01:06 2013-05-19 05:01:06 LustreError: dumping log to /tmp/lustre-log.1368964866.19317 2013-05-19 05:03:39 Lustre: 19285:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-130), not sending early reply 2013-05-19 05:03:39 req@ffff880fff21e800 x1434323172709123/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1368965029 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 05:04:25 Lustre: fsv-MDT0000: Client 412727fc-9d8f-c0e3-b97c-26e6f8acc04d (at 172.20.17.120@o2ib500) reconnecting 2013-05-19 05:04:25 Lustre: Skipped 26 previous similar messages 2013-05-19 05:04:25 Lustre: fsv-MDT0000: Client 412727fc-9d8f-c0e3-b97c-26e6f8acc04d (at 172.20.17.120@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:04:25 Lustre: Skipped 18 previous similar messages 2013-05-19 05:10:04 Lustre: 19317:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (740:375s); client may timeout. req@ffff880fff21e800 x1434323172709123/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1368965029 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 05:10:04 LNet: Service thread pid 19317 completed after 1114.65s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 05:30:44 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) reconnecting 2013-05-19 05:30:44 Lustre: Skipped 12 previous similar messages 2013-05-19 05:30:44 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:30:44 Lustre: Skipped 10 previous similar messages 2013-05-19 05:31:28 LNet: Service thread pid 19020 was inactive for 246.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:31:28 Pid: 19020, comm: mdt_rdpg03_005 2013-05-19 05:31:28 2013-05-19 05:31:28 Call Trace: 2013-05-19 05:31:28 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:31:28 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:31:28 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:31:28 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:31:28 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:31:28 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:31:28 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:31:28 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 05:31:28 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 05:31:28 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 05:31:28 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 05:31:28 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 05:31:28 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 05:31:28 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:31:28 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 05:31:28 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:31:28 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:31:28 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:31:28 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:31:28 [] ? __wake_up+0x53/0x70 2013-05-19 05:31:28 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:31:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:31:28 [] child_rip+0xa/0x20 2013-05-19 05:31:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:31:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:31:28 [] ? child_rip+0x0/0x20 2013-05-19 05:31:28 2013-05-19 05:31:28 LustreError: dumping log to /tmp/lustre-log.1368966688.19020 2013-05-19 05:32:06 Lustre: fsv-MDT0000: Client 3431632c-e0d6-6ed4-a40c-a2220b281792 (at 172.20.17.177@o2ib500) reconnecting 2013-05-19 05:32:06 Lustre: Skipped 10 previous similar messages 2013-05-19 05:32:06 Lustre: fsv-MDT0000: Client 3431632c-e0d6-6ed4-a40c-a2220b281792 (at 172.20.17.177@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:32:06 Lustre: Skipped 9 previous similar messages 2013-05-19 05:32:38 LNet: Service thread pid 19020 completed after 316.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 05:34:38 Lustre: fsv-MDT0000: Client 1a793117-025e-e79b-b9a4-3baf3d388d0b (at 172.20.17.53@o2ib500) reconnecting 2013-05-19 05:34:38 Lustre: Skipped 12 previous similar messages 2013-05-19 05:34:38 Lustre: fsv-MDT0000: Client 1a793117-025e-e79b-b9a4-3baf3d388d0b (at 172.20.17.53@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:34:38 Lustre: Skipped 7 previous similar messages 2013-05-19 05:35:02 LNet: Service thread pid 19365 was inactive for 336.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:35:02 Pid: 19365, comm: mdt01_046 2013-05-19 05:35:02 2013-05-19 05:35:02 Call Trace: 2013-05-19 05:35:02 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 05:35:02 [] ? enqueue_task_fair+0xfb/0x100 2013-05-19 05:35:02 [] ? __mutex_lock_slowpath+0x70/0x180 2013-05-19 05:35:02 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 05:35:02 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:35:02 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:35:02 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:35:02 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:35:02 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:35:02 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:35:02 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:35:02 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 05:35:02 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 05:35:02 [] mdd_create+0x929/0x1770 [mdd] 2013-05-19 05:35:02 [] ? lod_index_lookup+0x0/0x30 [lod] 2013-05-19 05:35:02 [] mdt_reint_open+0x1422/0x2120 [mdt] 2013-05-19 05:35:02 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 05:35:02 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 05:35:02 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 05:35:02 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 05:35:02 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 05:35:02 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 05:35:02 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 05:35:02 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 05:35:02 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 05:35:02 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 05:35:02 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:35:02 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 05:35:02 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:35:02 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:35:02 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:35:02 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:35:02 [] ? __wake_up+0x53/0x70 2013-05-19 05:35:02 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:35:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:35:02 [] child_rip+0xa/0x20 2013-05-19 05:35:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:35:02 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:35:02 [] ? child_rip+0x0/0x20 2013-05-19 05:35:02 2013-05-19 05:35:02 LustreError: dumping log to /tmp/lustre-log.1368966902.19365 2013-05-19 05:36:11 LNet: Service thread pid 18391 was inactive for 290.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:36:11 Pid: 18391, comm: mdt_rdpg01_001 2013-05-19 05:36:11 2013-05-19 05:36:11 Call Trace: 2013-05-19 05:36:11 [] ? __mutex_lock_slowpath+0x70/0x180 2013-05-19 05:36:11 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 05:36:11 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:36:11 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:36:11 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:36:11 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:36:11 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:36:11 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:36:11 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:36:11 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 05:36:11 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 05:36:11 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 05:36:11 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 05:36:11 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 05:36:11 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 05:36:11 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:36:11 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 05:36:11 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:36:11 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:36:11 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:36:11 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:36:11 [] ? __wake_up+0x53/0x70 2013-05-19 05:36:11 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:36:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:36:11 [] child_rip+0xa/0x20 2013-05-19 05:36:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:36:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:36:11 [] ? child_rip+0x0/0x20 2013-05-19 05:36:11 2013-05-19 05:36:11 LustreError: dumping log to /tmp/lustre-log.1368966971.18391 2013-05-19 05:36:35 LNet: Service thread pid 19365 completed after 428.87s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 05:37:42 LNet: Service thread pid 18391 completed after 381.27s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 05:42:55 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) reconnecting 2013-05-19 05:42:55 Lustre: Skipped 25 previous similar messages 2013-05-19 05:42:55 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:42:55 Lustre: Skipped 19 previous similar messages 2013-05-19 05:46:48 LNet: Service thread pid 19024 was inactive for 674.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:46:48 Pid: 19024, comm: mdt03_006 2013-05-19 05:46:48 2013-05-19 05:46:48 Call Trace: 2013-05-19 05:46:48 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 05:46:48 [] ? wake_up_process+0x15/0x20 2013-05-19 05:46:48 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:46:48 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:46:48 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:46:48 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:46:48 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:46:48 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:46:48 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:46:48 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 05:46:48 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 05:46:48 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 05:46:48 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 05:46:48 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 05:46:48 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 05:46:48 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 05:46:48 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 05:46:48 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 05:46:48 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 05:46:48 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 05:46:48 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 05:46:48 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 05:46:48 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 05:46:48 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 05:46:48 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 05:46:48 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 05:46:48 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 05:46:48 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 05:46:48 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 05:46:48 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 05:46:48 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 05:46:48 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 05:46:48 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 05:46:48 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:46:48 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 05:46:48 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:46:48 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:46:48 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:46:48 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:46:48 [] ? __wake_up+0x53/0x70 2013-05-19 05:46:48 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:46:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:46:48 [] child_rip+0xa/0x20 2013-05-19 05:46:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:46:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:46:48 [] ? child_rip+0x0/0x20 2013-05-19 05:46:48 2013-05-19 05:46:48 LustreError: dumping log to /tmp/lustre-log.1368967608.19024 2013-05-19 05:49:46 Lustre: 20160:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-252), not sending early reply 2013-05-19 05:49:46 req@ffff880fea058800 x1434323172783401/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1368967796 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 05:51:29 LNet: Service thread pid 19479 was inactive for 328.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 05:51:29 Pid: 19479, comm: mdt_rdpg00_009 2013-05-19 05:51:29 2013-05-19 05:51:29 Call Trace: 2013-05-19 05:51:29 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 05:51:29 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 05:51:29 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 05:51:29 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 05:51:29 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 05:51:29 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 05:51:29 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 05:51:29 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 05:51:29 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 05:51:29 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 05:51:29 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 05:51:29 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 05:51:29 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 05:51:29 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 05:51:29 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 05:51:30 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 05:51:30 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 05:51:30 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 05:51:30 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 05:51:30 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 05:51:30 [] ? __wake_up+0x53/0x70 2013-05-19 05:51:30 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 05:51:30 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:51:30 [] child_rip+0xa/0x20 2013-05-19 05:51:30 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:51:30 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 05:51:30 [] ? child_rip+0x0/0x20 2013-05-19 05:51:30 2013-05-19 05:51:30 LustreError: dumping log to /tmp/lustre-log.1368967890.19479 2013-05-19 05:52:03 LNet: Service thread pid 19479 completed after 361.96s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 05:53:02 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 05:53:02 Lustre: Skipped 11 previous similar messages 2013-05-19 05:53:02 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 05:53:02 Lustre: Skipped 8 previous similar messages 2013-05-19 05:53:11 Lustre: 19024:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (862:195s); client may timeout. req@ffff880fea058800 x1434323172783401/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1368967796 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 05:53:11 LNet: Service thread pid 19024 completed after 1056.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 06:00:00 PDT. 2013-05-19 06:10:22 Lustre: fsv-MDT0000: Client 9ca92f80-1431-bf08-73e5-4e4d1365a110 (at 172.20.17.16@o2ib500) reconnecting 2013-05-19 06:10:22 Lustre: Skipped 1 previous similar message 2013-05-19 06:10:22 Lustre: fsv-MDT0000: Client 9ca92f80-1431-bf08-73e5-4e4d1365a110 (at 172.20.17.16@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 06:15:32 Lustre: lock timed out (enqueued at 1368969132, 200s ago) 2013-05-19 06:15:32 Lustre: Skipped 1 previous similar message 2013-05-19 06:16:09 Lustre: lock timed out (enqueued at 1368969169, 200s ago) 2013-05-19 06:16:09 Lustre: Skipped 1 previous similar message 2013-05-19 06:16:40 Lustre: lock timed out (enqueued at 1368969200, 200s ago) 2013-05-19 06:17:10 Lustre: lock timed out (enqueued at 1368969230, 200s ago) 2013-05-19 06:17:10 Lustre: Skipped 1 previous similar message 2013-05-19 06:22:54 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 06:22:54 Lustre: Skipped 10 previous similar messages 2013-05-19 06:22:54 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 06:22:54 Lustre: Skipped 6 previous similar messages 2013-05-19 06:23:41 LNet: Service thread pid 18990 was inactive for 258.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 06:23:41 Pid: 18990, comm: mdt_rdpg00_002 2013-05-19 06:23:41 2013-05-19 06:23:41 Call Trace: 2013-05-19 06:23:41 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 06:23:41 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 06:23:41 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 06:23:41 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 06:23:41 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 06:23:41 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 06:23:41 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 06:23:41 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 06:23:41 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 06:23:41 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 06:23:41 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 06:23:41 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 06:23:41 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 06:23:41 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 06:23:41 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 06:23:41 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 06:23:41 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 06:23:41 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 06:23:41 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 06:23:41 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 06:23:41 [] ? __wake_up+0x53/0x70 2013-05-19 06:23:41 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 06:23:41 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:23:41 [] child_rip+0xa/0x20 2013-05-19 06:23:41 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:23:41 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:23:41 [] ? child_rip+0x0/0x20 2013-05-19 06:23:41 2013-05-19 06:23:41 LustreError: dumping log to /tmp/lustre-log.1368969821.18990 2013-05-19 06:23:51 Lustre: lock timed out (enqueued at 1368969631, 200s ago) 2013-05-19 06:23:54 LNet: Service thread pid 18990 completed after 271.39s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 06:24:32 LNet: Service thread pid 19382 was inactive for 294.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 06:24:32 Pid: 19382, comm: mdt_rdpg01_007 2013-05-19 06:24:32 2013-05-19 06:24:32 Call Trace: 2013-05-19 06:24:32 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 06:24:32 [] ? wake_up_process+0x15/0x20 2013-05-19 06:24:32 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 06:24:32 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 06:24:32 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 06:24:32 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 06:24:32 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 06:24:32 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 06:24:32 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 06:24:32 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 06:24:32 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 06:24:32 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 06:24:32 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 06:24:32 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 06:24:32 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 06:24:32 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 06:24:32 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 06:24:32 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 06:24:32 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 06:24:32 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 06:24:32 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 06:24:32 [] ? default_wake_function+0x0/0x20 2013-05-19 06:24:32 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 06:24:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:24:32 [] child_rip+0xa/0x20 2013-05-19 06:24:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:24:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:24:32 [] ? child_rip+0x0/0x20 2013-05-19 06:24:32 2013-05-19 06:24:32 LustreError: dumping log to /tmp/lustre-log.1368969872.19382 2013-05-19 06:25:43 LNet: Service thread pid 19382 completed after 365.76s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 06:40:00 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 06:40:00 Lustre: Skipped 21 previous similar messages 2013-05-19 06:40:00 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 06:40:00 Lustre: Skipped 15 previous similar messages 2013-05-19 06:40:40 LNet: Service thread pid 19011 was inactive for 236.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 06:40:40 Pid: 19011, comm: mdt_rdpg00_004 2013-05-19 06:40:40 2013-05-19 06:40:40 Call Trace: 2013-05-19 06:40:40 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 06:40:40 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 06:40:40 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 06:40:40 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 06:40:40 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 06:40:40 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 06:40:40 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 06:40:40 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 06:40:40 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 06:40:40 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 06:40:40 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 06:40:40 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 06:40:40 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 06:40:40 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 06:40:40 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 06:40:40 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 06:40:40 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 06:40:40 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 06:40:40 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 06:40:40 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 06:40:40 [] ? __wake_up+0x53/0x70 2013-05-19 06:40:40 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 06:40:40 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:40:40 [] child_rip+0xa/0x20 2013-05-19 06:40:40 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:40:40 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:40:40 [] ? child_rip+0x0/0x20 2013-05-19 06:40:40 2013-05-19 06:40:40 LustreError: dumping log to /tmp/lustre-log.1368970840.19011 2013-05-19 06:40:41 LNet: Service thread pid 19011 completed after 237.91s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 06:41:49 LNet: Service thread pid 19213 was inactive for 280.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 06:41:49 Pid: 19213, comm: mdt_rdpg02_005 2013-05-19 06:41:49 2013-05-19 06:41:49 Call Trace: 2013-05-19 06:41:49 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 06:41:49 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 06:41:49 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 06:41:49 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 06:41:49 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 06:41:49 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 06:41:49 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 06:41:49 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 06:41:49 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 06:41:49 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 06:41:49 [] mdd_attr_set+0x4a3/0x1390 [mdd] 2013-05-19 06:41:49 [] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] 2013-05-19 06:41:49 [] mdt_mfd_close+0x502/0x6e0 [mdt] 2013-05-19 06:41:49 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 06:41:49 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 06:41:49 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 06:41:49 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 06:41:49 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 06:41:49 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 06:41:49 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 06:41:49 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 06:41:49 [] ? __wake_up+0x53/0x70 2013-05-19 06:41:49 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 06:41:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:41:49 [] child_rip+0xa/0x20 2013-05-19 06:41:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:41:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:41:49 [] ? child_rip+0x0/0x20 2013-05-19 06:41:49 2013-05-19 06:41:49 LustreError: dumping log to /tmp/lustre-log.1368970909.19213 2013-05-19 06:42:12 LNet: Service thread pid 19213 completed after 302.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 06:46:36 LNet: Service thread pid 19967 was inactive for 344.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 06:46:36 Pid: 19967, comm: mdt02_073 2013-05-19 06:46:36 2013-05-19 06:46:36 Call Trace: 2013-05-19 06:46:36 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 06:46:36 [] ? wake_up_process+0x15/0x20 2013-05-19 06:46:36 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 06:46:36 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 06:46:36 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 06:46:36 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 06:46:36 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 06:46:36 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 06:46:36 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 06:46:36 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 06:46:36 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 06:46:36 [] mdd_rename+0x4ae/0x2330 [mdd] 2013-05-19 06:46:36 [] ? lu_object_put+0x1ce/0x330 [obdclass] 2013-05-19 06:46:36 [] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] 2013-05-19 06:46:36 [] mdt_reint_rename+0x13d5/0x1b10 [mdt] 2013-05-19 06:46:36 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 06:46:36 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 06:46:36 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 06:46:36 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 06:46:36 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 06:46:36 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 06:46:36 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 06:46:36 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 06:46:36 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 06:46:36 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 06:46:36 [] ? __wake_up+0x53/0x70 2013-05-19 06:46:36 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 06:46:36 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:46:36 [] child_rip+0xa/0x20 2013-05-19 06:46:36 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:46:36 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 06:46:36 [] ? child_rip+0x0/0x20 2013-05-19 06:46:36 2013-05-19 06:46:36 LustreError: dumping log to /tmp/lustre-log.1368971196.19967 2013-05-19 06:47:02 LNet: Service thread pid 19967 completed after 369.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 07:00:00 PDT. 2013-05-19 07:05:55 LNet: Service thread pid 19309 was inactive for 566.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 07:05:55 Pid: 19309, comm: mdt03_030 2013-05-19 07:05:55 2013-05-19 07:05:55 Call Trace: 2013-05-19 07:05:55 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 07:05:55 [] ? wake_up_process+0x15/0x20 2013-05-19 07:05:55 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 07:05:55 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 07:05:55 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 07:05:55 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 07:05:55 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 07:05:55 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 07:05:55 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 07:05:55 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 07:05:55 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 07:05:55 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 07:05:55 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 07:05:55 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 07:05:55 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 07:05:55 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 07:05:55 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 07:05:55 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 07:05:55 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 07:05:55 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 07:05:55 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 07:05:55 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 07:05:55 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 07:05:55 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 07:05:55 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 07:05:55 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 07:05:55 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 07:05:55 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 07:05:55 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 07:05:55 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 07:05:55 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 07:05:55 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 07:05:55 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 07:05:55 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 07:05:55 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 07:05:55 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 07:05:55 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 07:05:55 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 07:05:55 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 07:05:55 [] ? __wake_up+0x53/0x70 2013-05-19 07:05:55 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 07:05:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 07:05:55 [] child_rip+0xa/0x20 2013-05-19 07:05:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 07:05:55 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 07:05:55 [] ? child_rip+0x0/0x20 2013-05-19 07:05:55 2013-05-19 07:05:55 LustreError: dumping log to /tmp/lustre-log.1368972355.19309 2013-05-19 07:08:25 Lustre: 19024:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-116), not sending early reply 2013-05-19 07:08:25 req@ffff880b4020f000 x1434323172889080/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1368972515 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 07:09:49 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) reconnecting 2013-05-19 07:09:50 Lustre: Skipped 33 previous similar messages 2013-05-19 07:09:50 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:09:50 Lustre: Skipped 24 previous similar messages 2013-05-19 07:10:07 Lustre: 19309:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (726:92s); client may timeout. req@ffff880b4020f000 x1434323172889080/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1368972515 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 07:10:07 LNet: Service thread pid 19309 completed after 817.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 07:25:21 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) reconnecting 2013-05-19 07:25:21 Lustre: Skipped 1 previous similar message 2013-05-19 07:25:21 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 07:25:46 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) reconnecting 2013-05-19 07:25:46 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 07:26:11 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) reconnecting 2013-05-19 07:26:11 Lustre: fsv-MDT0000: Client b9b7fd3d-ecaf-02f7-e349-b730aeb76248 (at 172.20.17.5@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:26:51 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) reconnecting 2013-05-19 07:26:51 Lustre: Skipped 2 previous similar messages 2013-05-19 07:44:28 Lustre: fsv-MDT0000: Client 7b5e1211-3724-daaa-ce62-edcbd062b2ae (at 172.20.17.27@o2ib500) reconnecting 2013-05-19 07:44:28 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:44:28 Lustre: Skipped 2 previous similar messages 2013-05-19 07:44:28 Lustre: Skipped 2 previous similar messages 2013-05-19 07:44:49 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 07:44:49 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:44:49 Lustre: Skipped 1 previous similar message 2013-05-19 07:44:58 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:44:58 Lustre: Skipped 3 previous similar messages 2013-05-19 07:45:14 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 07:45:14 Lustre: Skipped 6 previous similar messages 2013-05-19 07:51:25 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) reconnecting 2013-05-19 07:51:25 Lustre: Skipped 6 previous similar messages 2013-05-19 07:51:25 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 07:51:25 Lustre: Skipped 2 previous similar messages 2013-05-19 07:52:15 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:52:15 Lustre: Skipped 3 previous similar messages 2013-05-19 07:52:41 Lustre: fsv-MDT0000: Client 262d17ec-e707-46b5-4cdb-401369638252 (at 172.20.17.32@o2ib500) reconnecting 2013-05-19 07:52:41 Lustre: Skipped 7 previous similar messages 2013-05-19 07:57:13 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) reconnecting 2013-05-19 07:57:13 Lustre: Skipped 1 previous similar message 2013-05-19 07:57:13 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 07:57:13 Lustre: Skipped 2 previous similar messages Console [vesta-mds1] log at 2013-05-19 08:00:00 PDT. 2013-05-19 08:04:40 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 08:04:40 Lustre: Skipped 5 previous similar messages 2013-05-19 08:04:40 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 08:04:40 Lustre: Skipped 3 previous similar messages 2013-05-19 08:27:39 LNet: Service thread pid 19358 was inactive for 482.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 08:27:39 Pid: 19358, comm: mdt03_044 2013-05-19 08:27:39 2013-05-19 08:27:39 Call Trace: 2013-05-19 08:27:39 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 08:27:39 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 08:27:39 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 08:27:39 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 08:27:39 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 08:27:39 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 08:27:39 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 08:27:39 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 08:27:39 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 08:27:39 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 08:27:39 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 08:27:39 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 08:27:39 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 08:27:39 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 08:27:39 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 08:27:39 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 08:27:39 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 08:27:39 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 08:27:39 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 08:27:39 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 08:27:39 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 08:27:39 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 08:27:39 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 08:27:39 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 08:27:39 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 08:27:39 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 08:27:39 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 08:27:39 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 08:27:39 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 08:27:39 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 08:27:39 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 08:27:39 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 08:27:39 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 08:27:39 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 08:27:39 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 08:27:39 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 08:27:39 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 08:27:39 [] ? __wake_up+0x53/0x70 2013-05-19 08:27:39 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 08:27:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 08:27:39 [] child_rip+0xa/0x20 2013-05-19 08:27:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 08:27:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 08:27:39 [] ? child_rip+0x0/0x20 2013-05-19 08:27:39 2013-05-19 08:27:39 LustreError: dumping log to /tmp/lustre-log.1368977259.19358 2013-05-19 08:29:49 Lustre: 19259:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-12), not sending early reply 2013-05-19 08:29:49 req@ffff880b6b07e000 x1434323173103402/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1368977399 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 08:32:10 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 08:32:10 Lustre: Skipped 5 previous similar messages 2013-05-19 08:32:10 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 08:32:10 Lustre: Skipped 3 previous similar messages 2013-05-19 08:33:00 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 08:33:00 Lustre: Skipped 1 previous similar message 2013-05-19 08:33:14 Lustre: 19358:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (622:195s); client may timeout. req@ffff880b6b07e000 x1434323173103402/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1368977399 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 08:33:14 LNet: Service thread pid 19358 completed after 817.58s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 08:33:33 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 08:33:33 Lustre: Skipped 4 previous similar messages 2013-05-19 08:39:23 Lustre: fsv-MDT0000: Client 9ca92f80-1431-bf08-73e5-4e4d1365a110 (at 172.20.17.16@o2ib500) reconnecting 2013-05-19 08:39:23 Lustre: fsv-MDT0000: Client 9ca92f80-1431-bf08-73e5-4e4d1365a110 (at 172.20.17.16@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 08:39:23 Lustre: Skipped 1 previous similar message 2013-05-19 08:47:20 Lustre: lock timed out (enqueued at 1368978240, 200s ago) 2013-05-19 08:47:42 Lustre: lock timed out (enqueued at 1368978240, 222s ago) 2013-05-19 08:48:06 Lustre: lock timed out (enqueued at 1368978264, 222s ago) Console [vesta-mds1] log at 2013-05-19 09:00:00 PDT. 2013-05-19 09:07:33 LNet: Service thread pid 19479 was inactive for 316.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 09:07:33 Pid: 19479, comm: mdt_rdpg00_009 2013-05-19 09:07:33 2013-05-19 09:07:33 Call Trace: 2013-05-19 09:07:33 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 09:07:33 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 09:07:33 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 09:07:33 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 09:07:33 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 09:07:33 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 09:07:33 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 09:07:33 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 09:07:33 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 09:07:33 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 09:07:33 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 09:07:33 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 09:07:33 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 09:07:33 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 09:07:33 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 09:07:33 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 09:07:33 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 09:07:33 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 09:07:33 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 09:07:33 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 09:07:33 [] ? __wake_up+0x53/0x70 2013-05-19 09:07:33 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 09:07:33 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 09:07:33 [] child_rip+0xa/0x20 2013-05-19 09:07:33 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 09:07:33 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 09:07:33 [] ? child_rip+0x0/0x20 2013-05-19 09:07:33 2013-05-19 09:07:33 LustreError: dumping log to /tmp/lustre-log.1368979653.19479 2013-05-19 09:07:57 LNet: Service thread pid 19479 completed after 339.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 09:08:13 Lustre: fsv-MDT0000: Client ac10a1e3-445c-4d46-e60d-ecaafab054ac (at 172.20.17.48@o2ib500) reconnecting 2013-05-19 09:08:13 Lustre: Skipped 3 previous similar messages 2013-05-19 09:08:13 Lustre: fsv-MDT0000: Client ac10a1e3-445c-4d46-e60d-ecaafab054ac (at 172.20.17.48@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 09:08:13 Lustre: Skipped 2 previous similar messages 2013-05-19 09:20:52 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 09:20:52 Lustre: Skipped 1 previous similar message 2013-05-19 09:20:52 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 09:21:17 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 09:27:06 Lustre: lock timed out (enqueued at 1368980626, 200s ago) 2013-05-19 09:28:23 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) reconnecting 2013-05-19 09:28:23 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 09:28:48 Lustre: lock timed out (enqueued at 1368980728, 200s ago) 2013-05-19 09:28:48 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) reconnecting 2013-05-19 09:28:48 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 09:50:44 Lustre: lock timed out (enqueued at 1368982044, 200s ago) Console [vesta-mds1] log at 2013-05-19 10:00:00 PDT. 2013-05-19 10:18:42 Lustre: fsv-MDT0000: Client 56c76c66-c0d9-60eb-497d-80fd89bb8e1b (at 172.20.17.47@o2ib500) reconnecting 2013-05-19 10:18:42 Lustre: Skipped 3 previous similar messages 2013-05-19 10:18:42 Lustre: fsv-MDT0000: Client 56c76c66-c0d9-60eb-497d-80fd89bb8e1b (at 172.20.17.47@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 10:18:42 Lustre: Skipped 1 previous similar message 2013-05-19 10:19:07 Lustre: fsv-MDT0000: Client 56c76c66-c0d9-60eb-497d-80fd89bb8e1b (at 172.20.17.47@o2ib500) reconnecting 2013-05-19 10:23:33 Lustre: lock timed out (enqueued at 1368984013, 200s ago) 2013-05-19 10:24:10 LNet: Service thread pid 19355 was inactive for 664.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 10:24:10 Pid: 19355, comm: mdt03_043 2013-05-19 10:24:10 2013-05-19 10:24:10 Call Trace: 2013-05-19 10:24:10 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 10:24:10 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 10:24:10 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 10:24:10 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 10:24:10 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 10:24:10 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 10:24:10 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 10:24:10 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 10:24:10 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 10:24:10 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 10:24:10 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 10:24:10 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 10:24:10 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 10:24:10 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 10:24:10 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 10:24:10 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 10:24:10 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 10:24:10 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 10:24:10 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 10:24:10 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 10:24:10 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 10:24:10 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 10:24:10 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 10:24:10 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 10:24:10 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 10:24:10 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 10:24:10 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 10:24:10 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 10:24:10 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 10:24:10 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 10:24:10 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 10:24:10 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 10:24:10 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 10:24:10 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 10:24:10 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 10:24:10 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 10:24:10 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 10:24:10 [] ? __wake_up+0x53/0x70 2013-05-19 10:24:10 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 10:24:10 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:24:10 [] child_rip+0xa/0x20 2013-05-19 10:24:10 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:24:10 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:24:10 [] ? child_rip+0x0/0x20 2013-05-19 10:24:10 2013-05-19 10:24:10 LustreError: dumping log to /tmp/lustre-log.1368984250.19355 2013-05-19 10:26:11 LNet: Service thread pid 19355 completed after 785.50s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 10:46:39 LNet: Service thread pid 21574 was inactive for 504.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 10:46:39 Pid: 21574, comm: mdt03_064 2013-05-19 10:46:39 2013-05-19 10:46:39 Call Trace: 2013-05-19 10:46:39 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 10:46:39 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 10:46:39 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 10:46:39 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 10:46:39 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 10:46:39 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 10:46:39 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 10:46:39 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 10:46:39 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 10:46:39 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 10:46:39 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 10:46:39 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 10:46:39 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 10:46:39 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 10:46:39 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 10:46:39 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 10:46:39 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 10:46:39 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 10:46:39 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 10:46:39 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 10:46:39 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 10:46:39 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 10:46:39 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 10:46:39 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 10:46:39 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 10:46:39 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 10:46:39 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 10:46:39 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 10:46:39 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 10:46:39 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 10:46:39 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 10:46:39 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 10:46:39 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 10:46:39 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 10:46:39 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 10:46:39 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 10:46:39 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 10:46:39 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 10:46:39 [] ? __wake_up+0x53/0x70 2013-05-19 10:46:39 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 10:46:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:46:39 [] child_rip+0xa/0x20 2013-05-19 10:46:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:46:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 10:46:39 [] ? child_rip+0x0/0x20 2013-05-19 10:46:39 2013-05-19 10:46:39 LustreError: dumping log to /tmp/lustre-log.1368985599.21574 2013-05-19 10:47:35 LNet: Service thread pid 21574 completed after 559.45s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 11:00:00 PDT. 2013-05-19 11:00:56 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 11:00:56 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 11:01:21 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 11:01:21 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:01:43 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 11:01:43 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:01:43 Lustre: Skipped 1 previous similar message 2013-05-19 11:01:46 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:01:46 Lustre: Skipped 1 previous similar message 2013-05-19 11:01:57 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 11:01:57 Lustre: Skipped 3 previous similar messages 2013-05-19 11:01:57 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:02:08 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 11:02:08 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:02:08 Lustre: Skipped 1 previous similar message 2013-05-19 11:02:19 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:02:19 Lustre: Skipped 4 previous similar messages 2013-05-19 11:02:27 LNet: Service thread pid 19290 was inactive for 300.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 11:02:27 Pid: 19290, comm: mdt00_027 2013-05-19 11:02:27 2013-05-19 11:02:27 Call Trace: 2013-05-19 11:02:27 [] schedule_timeout+0x192/0x2e0 2013-05-19 11:02:27 [] ? process_timeout+0x0/0x10 2013-05-19 11:02:27 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 11:02:27 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 11:02:27 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 11:02:27 [] ? default_wake_function+0x0/0x20 2013-05-19 11:02:27 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 11:02:27 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 11:02:27 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 11:02:27 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 11:02:27 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 11:02:27 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 11:02:27 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 11:02:27 [] mdt_reint_rename+0xab7/0x1b10 [mdt] 2013-05-19 11:02:27 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 11:02:27 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 11:02:27 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 11:02:27 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 11:02:27 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 11:02:27 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 11:02:27 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 11:02:27 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 11:02:27 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 11:02:27 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 11:02:27 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 11:02:27 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 11:02:27 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 11:02:27 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 11:02:27 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 11:02:27 [] ? __wake_up+0x53/0x70 2013-05-19 11:02:27 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 11:02:27 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:02:27 [] child_rip+0xa/0x20 2013-05-19 11:02:27 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:02:27 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:02:27 [] ? child_rip+0x0/0x20 2013-05-19 11:02:27 2013-05-19 11:02:27 LustreError: dumping log to /tmp/lustre-log.1368986547.19290 2013-05-19 11:02:33 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 11:02:33 Lustre: Skipped 9 previous similar messages 2013-05-19 11:02:33 LNet: Service thread pid 19290 completed after 305.87s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 11:13:28 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 11:13:28 Lustre: Skipped 9 previous similar messages 2013-05-19 11:13:28 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 11:13:28 Lustre: Skipped 4 previous similar messages 2013-05-19 11:13:53 Lustre: fsv-MDT0000: Client 99d90341-209a-167e-5286-d8a94267e507 (at 172.20.17.7@o2ib500) reconnecting 2013-05-19 11:13:53 Lustre: Skipped 1 previous similar message 2013-05-19 11:13:53 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:14:18 Lustre: fsv-MDT0000: Client 78bdbeed-caca-62c1-546a-c30d23dc899c (at 172.20.17.46@o2ib500) reconnecting 2013-05-19 11:15:02 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) reconnecting 2013-05-19 11:15:02 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:15:27 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:15:41 Lustre: fsv-MDT0000: Client 262d17ec-e707-46b5-4cdb-401369638252 (at 172.20.17.32@o2ib500) reconnecting 2013-05-19 11:15:41 Lustre: Skipped 1 previous similar message 2013-05-19 11:15:51 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:15:52 Lustre: Skipped 1 previous similar message 2013-05-19 11:16:15 LNet: Service thread pid 19951 was inactive for 328.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 11:16:15 Pid: 19951, comm: mdt02_057 2013-05-19 11:16:15 2013-05-19 11:16:15 Call Trace: 2013-05-19 11:16:15 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 11:16:15 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 11:16:15 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 11:16:15 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 11:16:15 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 11:16:15 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 11:16:15 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 11:16:15 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 11:16:15 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 11:16:15 [] mdd_rename+0x4ae/0x2330 [mdd] 2013-05-19 11:16:15 [] ? lu_object_put+0x1ce/0x330 [obdclass] 2013-05-19 11:16:15 [] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] 2013-05-19 11:16:15 [] mdt_reint_rename+0x13d5/0x1b10 [mdt] 2013-05-19 11:16:15 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 11:16:15 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 11:16:15 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 11:16:15 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 11:16:15 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 11:16:15 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 11:16:15 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 11:16:15 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 11:16:15 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 11:16:15 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 11:16:15 [] ? __wake_up+0x53/0x70 2013-05-19 11:16:15 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 11:16:15 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:16:15 [] child_rip+0xa/0x20 2013-05-19 11:16:15 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:16:15 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 11:16:15 [] ? child_rip+0x0/0x20 2013-05-19 11:16:15 2013-05-19 11:16:15 LustreError: dumping log to /tmp/lustre-log.1368987375.19951 2013-05-19 11:16:15 LNet: Service thread pid 19951 completed after 328.72s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 11:17:17 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) reconnecting 2013-05-19 11:17:17 Lustre: Skipped 4 previous similar messages 2013-05-19 11:17:17 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:17:17 Lustre: Skipped 1 previous similar message 2013-05-19 11:18:27 Lustre: fsv-MDT0000: Client a707e637-e529-165a-d3f6-589f52d895c9 (at 172.20.17.9@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:21:03 Lustre: fsv-MDT0000: Client b1cc8a79-fc54-6daf-240e-142cdd81e769 (at 172.20.17.17@o2ib500) reconnecting 2013-05-19 11:21:03 Lustre: Skipped 3 previous similar messages 2013-05-19 11:21:03 Lustre: fsv-MDT0000: Client b1cc8a79-fc54-6daf-240e-142cdd81e769 (at 172.20.17.17@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:42:12 Lustre: fsv-MDT0000: Client 8b1a6567-b9cf-8125-5b37-2921cb171367 (at 172.20.17.58@o2ib500) reconnecting 2013-05-19 11:42:12 Lustre: Skipped 3 previous similar messages 2013-05-19 11:42:12 Lustre: fsv-MDT0000: Client 8b1a6567-b9cf-8125-5b37-2921cb171367 (at 172.20.17.58@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 11:42:12 Lustre: Skipped 1 previous similar message 2013-05-19 11:42:58 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 11:42:58 Lustre: Skipped 2 previous similar messages 2013-05-19 11:48:51 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) reconnecting 2013-05-19 11:48:51 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 11:48:51 Lustre: Skipped 1 previous similar message 2013-05-19 11:59:40 LustreError: 19368:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc9006673a090 buffer[5] len 5: strlen 0 2013-05-19 11:59:40 LustreError: 19368:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (5 of 6) in format `LDLM_INTENT_GETATTR': 5 vs. 0 (client) 2013-05-19 11:59:40 req@ffff8800ba7c7000 x1435204924745208/t0(0) o101->9f5c433f-56fd-fbb5-3d26-a182144680ca@172.20.16.11@o2ib500:0/0 lens 576/3304 e 0 to 0 dl 1368990280 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Console [vesta-mds1] log at 2013-05-19 12:00:00 PDT. 2013-05-19 12:01:37 LNet: Service thread pid 21576 was inactive for 436.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 12:01:37 Pid: 21576, comm: mdt03_066 2013-05-19 12:01:37 2013-05-19 12:01:37 Call Trace: 2013-05-19 12:01:37 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 12:01:37 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 12:01:37 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 12:01:37 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 12:01:37 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 12:01:37 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 12:01:37 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 12:01:37 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 12:01:37 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 12:01:37 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 12:01:37 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 12:01:37 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 12:01:37 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 12:01:37 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 12:01:37 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 12:01:37 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 12:01:37 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 12:01:37 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 12:01:37 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 12:01:37 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 12:01:37 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 12:01:37 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 12:01:37 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 12:01:37 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 12:01:37 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 12:01:37 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 12:01:37 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 12:01:37 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 12:01:37 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 12:01:37 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 12:01:37 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 12:01:37 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 12:01:37 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 12:01:37 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 12:01:37 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 12:01:37 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 12:01:37 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 12:01:37 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 12:01:37 [] ? __wake_up+0x53/0x70 2013-05-19 12:01:37 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 12:01:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:01:37 [] child_rip+0xa/0x20 2013-05-19 12:01:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:01:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:01:37 [] ? child_rip+0x0/0x20 2013-05-19 12:01:37 2013-05-19 12:01:37 LustreError: dumping log to /tmp/lustre-log.1368990097.21576 2013-05-19 12:01:50 LNet: Service thread pid 21576 completed after 449.15s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 12:23:39 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 12:23:39 Lustre: Skipped 3 previous similar messages 2013-05-19 12:23:39 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 12:23:39 Lustre: Skipped 1 previous similar message 2013-05-19 12:24:04 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 12:26:18 LNet: Service thread pid 19268 was inactive for 550.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 12:26:18 Pid: 19268, comm: mdt03_021 2013-05-19 12:26:18 2013-05-19 12:26:18 Call Trace: 2013-05-19 12:26:18 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 12:26:18 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 12:26:18 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 12:26:18 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 12:26:18 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 12:26:18 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 12:26:18 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 12:26:18 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 12:26:18 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 12:26:18 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 12:26:18 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 12:26:18 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 12:26:18 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 12:26:18 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 12:26:18 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 12:26:18 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 12:26:18 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 12:26:18 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 12:26:18 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 12:26:18 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 12:26:18 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 12:26:18 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 12:26:18 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 12:26:18 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 12:26:18 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 12:26:18 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 12:26:18 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 12:26:18 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 12:26:18 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 12:26:18 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 12:26:18 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 12:26:18 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 12:26:18 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 12:26:18 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 12:26:18 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 12:26:18 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 12:26:18 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 12:26:18 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 12:26:19 [] ? __wake_up+0x53/0x70 2013-05-19 12:26:19 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 12:26:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:26:19 [] child_rip+0xa/0x20 2013-05-19 12:26:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:26:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:26:19 [] ? child_rip+0x0/0x20 2013-05-19 12:26:19 2013-05-19 12:26:19 LustreError: dumping log to /tmp/lustre-log.1368991578.19268 2013-05-19 12:26:49 LNet: Service thread pid 19268 completed after 580.40s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 12:50:21 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) reconnecting 2013-05-19 12:50:21 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 12:50:46 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) reconnecting 2013-05-19 12:50:46 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 12:51:11 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) reconnecting 2013-05-19 12:51:11 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 12:51:36 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) reconnecting 2013-05-19 12:51:36 Lustre: fsv-MDT0000: Client c3d12508-b90d-7f1b-28c3-268d68ad3fc9 (at 172.20.17.82@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 12:51:38 LNet: Service thread pid 19485 was inactive for 204.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 12:51:38 Pid: 19485, comm: mdt_rdpg01_010 2013-05-19 12:51:38 2013-05-19 12:51:38 Call Trace: 2013-05-19 12:51:38 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 12:51:38 [] ? wake_up_process+0x15/0x20 2013-05-19 12:51:38 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 12:51:38 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 12:51:38 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 12:51:38 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 12:51:38 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 12:51:38 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 12:51:38 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 12:51:38 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 12:51:38 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 12:51:38 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 12:51:38 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 12:51:38 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 12:51:38 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 12:51:38 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 12:51:38 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 12:51:38 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 12:51:38 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 12:51:38 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 12:51:38 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 12:51:38 [] ? __wake_up+0x53/0x70 2013-05-19 12:51:38 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 12:51:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:51:38 [] child_rip+0xa/0x20 2013-05-19 12:51:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:51:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 12:51:38 [] ? child_rip+0x0/0x20 2013-05-19 12:51:38 2013-05-19 12:51:38 LustreError: dumping log to /tmp/lustre-log.1368993098.19485 2013-05-19 12:51:47 LNet: Service thread pid 19485 completed after 213.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 13:00:00 PDT. 2013-05-19 13:04:29 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) reconnecting 2013-05-19 13:04:29 Lustre: Skipped 3 previous similar messages 2013-05-19 13:04:29 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:04:29 Lustre: Skipped 1 previous similar message 2013-05-19 13:04:54 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) reconnecting 2013-05-19 13:04:54 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:05:19 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) reconnecting 2013-05-19 13:05:19 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:05:38 LNet: Service thread pid 19487 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 13:05:38 Pid: 19487, comm: mdt_rdpg00_011 2013-05-19 13:05:38 2013-05-19 13:05:38 Call Trace: 2013-05-19 13:05:38 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 13:05:38 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 13:05:38 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 13:05:38 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 13:05:38 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 13:05:38 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 13:05:38 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 13:05:38 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 13:05:38 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 13:05:38 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 13:05:38 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 13:05:38 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 13:05:38 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 13:05:38 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 13:05:38 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 13:05:38 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 13:05:38 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 13:05:38 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 13:05:38 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 13:05:38 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 13:05:38 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 13:05:38 [] ? __wake_up+0x53/0x70 2013-05-19 13:05:38 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 13:05:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:05:38 [] child_rip+0xa/0x20 2013-05-19 13:05:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:05:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:05:38 [] ? child_rip+0x0/0x20 2013-05-19 13:05:38 2013-05-19 13:05:38 LustreError: dumping log to /tmp/lustre-log.1368993938.19487 2013-05-19 13:05:44 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) reconnecting 2013-05-19 13:05:44 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:06:09 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:06:13 LNet: Service thread pid 19487 completed after 234.58s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 13:06:34 Lustre: fsv-MDT0000: Client c4e5ea7a-59f0-8da5-f023-8e01636bc609 (at 172.20.17.38@o2ib500) reconnecting 2013-05-19 13:06:34 Lustre: Skipped 1 previous similar message 2013-05-19 13:07:51 Lustre: fsv-MDT0000: Client 1e6136b6-7de0-aa1f-25e2-0002dc076bd7 (at 172.20.17.21@o2ib500) reconnecting 2013-05-19 13:07:51 Lustre: fsv-MDT0000: Client 1e6136b6-7de0-aa1f-25e2-0002dc076bd7 (at 172.20.17.21@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:08:41 Lustre: fsv-MDT0000: Client 1e6136b6-7de0-aa1f-25e2-0002dc076bd7 (at 172.20.17.21@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:08:41 Lustre: Skipped 1 previous similar message 2013-05-19 13:10:49 LNet: Service thread pid 19251 was inactive for 822.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 13:10:49 Pid: 19251, comm: mdt03_017 2013-05-19 13:10:49 2013-05-19 13:10:49 Call Trace: 2013-05-19 13:10:49 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 13:10:49 [] ? wake_up_process+0x15/0x20 2013-05-19 13:10:49 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 13:10:49 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 13:10:49 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 13:10:49 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 13:10:49 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 13:10:49 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 13:10:49 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 13:10:49 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 13:10:49 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 13:10:49 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 13:10:49 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 13:10:49 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 13:10:49 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 13:10:49 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 13:10:49 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 13:10:49 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 13:10:49 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 13:10:49 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 13:10:49 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 13:10:49 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 13:10:49 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 13:10:49 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 13:10:49 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 13:10:49 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 13:10:49 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 13:10:49 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 13:10:49 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 13:10:49 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 13:10:49 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 13:10:49 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 13:10:49 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 13:10:49 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 13:10:49 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 13:10:49 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 13:10:49 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 13:10:49 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 13:10:49 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 13:10:49 [] ? __wake_up+0x53/0x70 2013-05-19 13:10:49 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 13:10:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:10:49 [] child_rip+0xa/0x20 2013-05-19 13:10:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:10:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:10:49 [] ? child_rip+0x0/0x20 2013-05-19 13:10:49 2013-05-19 13:10:49 LustreError: dumping log to /tmp/lustre-log.1368994249.19251 2013-05-19 13:13:47 LNet: Service thread pid 19251 completed after 999.88s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 13:24:56 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 13:24:56 Lustre: Skipped 3 previous similar messages 2013-05-19 13:24:56 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:25:21 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) reconnecting 2013-05-19 13:25:21 Lustre: fsv-MDT0000: Client 7d18f2b3-771d-2e5b-5c24-9f20045e76c0 (at 172.20.17.8@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:25:41 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:26:06 Lustre: fsv-MDT0000: Client ae1bebd4-b855-945f-584a-0d1cbd554898 (at 172.20.17.29@o2ib500) reconnecting 2013-05-19 13:26:06 Lustre: Skipped 2 previous similar messages 2013-05-19 13:31:20 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) reconnecting 2013-05-19 13:31:20 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 13:39:02 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 13:39:02 Lustre: Skipped 2 previous similar messages 2013-05-19 13:39:02 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:39:02 Lustre: Skipped 1 previous similar message 2013-05-19 13:44:21 Lustre: fsv-MDT0000: Client ae31f68b-6366-515a-fcc1-119eea2a147e (at 172.20.17.23@o2ib500) reconnecting 2013-05-19 13:44:21 Lustre: Skipped 1 previous similar message 2013-05-19 13:44:21 Lustre: fsv-MDT0000: Client ae31f68b-6366-515a-fcc1-119eea2a147e (at 172.20.17.23@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 13:44:57 LNet: Service thread pid 19387 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 13:44:57 Pid: 19387, comm: mdt_rdpg03_009 2013-05-19 13:44:57 2013-05-19 13:44:57 Call Trace: 2013-05-19 13:44:57 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 13:44:57 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 13:44:57 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 13:44:57 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 13:44:57 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 13:44:57 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 13:44:57 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 13:44:57 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 13:44:57 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 13:44:57 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 13:44:57 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 13:44:57 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 13:44:57 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 13:44:57 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 13:44:57 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 13:44:57 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 13:44:57 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 13:44:57 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 13:44:57 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 13:44:57 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 13:44:57 [] ? __wake_up+0x53/0x70 2013-05-19 13:44:57 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 13:44:57 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:44:57 [] child_rip+0xa/0x20 2013-05-19 13:44:57 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:44:57 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:44:57 [] ? child_rip+0x0/0x20 2013-05-19 13:44:57 2013-05-19 13:44:57 LustreError: dumping log to /tmp/lustre-log.1368996297.19387 2013-05-19 13:46:23 LNet: Service thread pid 19387 completed after 286.59s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 13:49:58 LNet: Service thread pid 19361 was inactive for 646.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 13:49:58 Pid: 19361, comm: mdt03_045 2013-05-19 13:49:58 2013-05-19 13:49:58 Call Trace: 2013-05-19 13:49:58 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 13:49:58 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 13:49:58 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 13:49:58 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 13:49:58 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 13:49:58 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 13:49:58 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 13:49:58 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 13:49:58 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 13:49:58 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 13:49:58 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 13:49:58 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 13:49:58 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 13:49:58 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 13:49:58 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 13:49:58 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 13:49:58 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 13:49:58 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 13:49:58 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 13:49:58 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 13:49:58 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 13:49:58 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 13:49:58 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 13:49:58 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 13:49:58 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 13:49:58 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 13:49:58 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 13:49:58 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 13:49:58 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 13:49:58 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 13:49:58 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 13:49:58 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 13:49:58 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 13:49:58 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 13:49:58 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 13:49:58 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 13:49:58 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 13:49:58 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 13:49:58 [] ? __wake_up+0x53/0x70 2013-05-19 13:49:58 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 13:49:58 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:49:58 [] child_rip+0xa/0x20 2013-05-19 13:49:58 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:49:58 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 13:49:58 [] ? child_rip+0x0/0x20 2013-05-19 13:49:58 2013-05-19 13:49:58 LustreError: dumping log to /tmp/lustre-log.1368996598.19361 2013-05-19 13:50:37 LNet: Service thread pid 19361 completed after 684.44s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 14:00:00 PDT. 2013-05-19 14:02:05 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) reconnecting 2013-05-19 14:02:05 Lustre: Skipped 12 previous similar messages 2013-05-19 14:02:05 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:02:05 Lustre: Skipped 9 previous similar messages 2013-05-19 14:02:59 Lustre: fsv-MDT0000: Client b1cc8a79-fc54-6daf-240e-142cdd81e769 (at 172.20.17.17@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:02:59 Lustre: Skipped 1 previous similar message 2013-05-19 14:19:32 Lustre: fsv-MDT0000: Client f6b629bd-07e6-84bd-fdc1-6f97b3bfb6a3 (at 172.20.17.55@o2ib500) reconnecting 2013-05-19 14:19:32 Lustre: Skipped 12 previous similar messages 2013-05-19 14:19:32 Lustre: fsv-MDT0000: Client f6b629bd-07e6-84bd-fdc1-6f97b3bfb6a3 (at 172.20.17.55@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:19:32 Lustre: Skipped 5 previous similar messages 2013-05-19 14:19:37 LustreError: 19365:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc900b7d68368 buffer[5] len 36: strlen 3 2013-05-19 14:19:37 LustreError: 19365:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (5 of 6) in format `LDLM_INTENT_GETATTR': 36 vs. 0 (client) 2013-05-19 14:19:37 req@ffff88006cd43c00 x1435204925018956/t0(0) o101->9f5c433f-56fd-fbb5-3d26-a182144680ca@172.20.16.11@o2ib500:0/0 lens 608/3304 e 0 to 0 dl 1368998609 ref 1 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 14:32:22 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) reconnecting 2013-05-19 14:32:22 Lustre: Skipped 1 previous similar message 2013-05-19 14:32:22 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:32:47 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:33:04 LNet: Service thread pid 18394 was inactive for 244.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 14:33:04 Pid: 18394, comm: mdt_rdpg03_000 2013-05-19 14:33:04 2013-05-19 14:33:04 Call Trace: 2013-05-19 14:33:04 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 14:33:04 [] ? wake_up_process+0x15/0x20 2013-05-19 14:33:04 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 14:33:04 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 14:33:04 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 14:33:04 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 14:33:04 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 14:33:04 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 14:33:04 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 14:33:04 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 14:33:04 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 14:33:04 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 14:33:04 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 14:33:04 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 14:33:04 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 14:33:04 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 14:33:04 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 14:33:04 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 14:33:04 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 14:33:04 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 14:33:04 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 14:33:04 [] ? __wake_up+0x53/0x70 2013-05-19 14:33:04 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 14:33:04 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 14:33:04 [] child_rip+0xa/0x20 2013-05-19 14:33:04 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 14:33:04 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 14:33:04 [] ? child_rip+0x0/0x20 2013-05-19 14:33:04 2013-05-19 14:33:04 LustreError: dumping log to /tmp/lustre-log.1368999184.18394 2013-05-19 14:33:12 Lustre: fsv-MDT0000: Client 94c13b4f-cd98-a67c-5c82-e2d28e81cf1e (at 172.20.17.10@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 14:33:31 LNet: Service thread pid 18394 completed after 271.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Console [vesta-mds1] log at 2013-05-19 15:00:00 PDT. 2013-05-19 15:08:39 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) reconnecting 2013-05-19 15:08:39 Lustre: Skipped 3 previous similar messages 2013-05-19 15:08:39 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:26:32 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) reconnecting 2013-05-19 15:26:32 Lustre: Skipped 1 previous similar message 2013-05-19 15:26:32 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:26:43 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) reconnecting 2013-05-19 15:26:43 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:26:57 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:27:08 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) reconnecting 2013-05-19 15:27:08 Lustre: Skipped 1 previous similar message 2013-05-19 15:27:08 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:27:19 LNet: Service thread pid 19002 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 15:27:19 Pid: 19002, comm: mdt_rdpg01_003 2013-05-19 15:27:19 2013-05-19 15:27:19 Call Trace: 2013-05-19 15:27:19 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 15:27:19 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 15:27:19 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 15:27:19 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 15:27:19 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 15:27:19 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 15:27:19 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 15:27:19 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 15:27:19 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 15:27:19 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 15:27:19 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 15:27:19 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 15:27:19 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 15:27:19 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 15:27:19 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 15:27:19 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 15:27:19 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 15:27:19 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 15:27:19 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 15:27:19 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 15:27:19 [] ? __wake_up+0x53/0x70 2013-05-19 15:27:19 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 15:27:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:19 [] child_rip+0xa/0x20 2013-05-19 15:27:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:19 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:19 [] ? child_rip+0x0/0x20 2013-05-19 15:27:19 2013-05-19 15:27:19 LustreError: dumping log to /tmp/lustre-log.1369002439.19002 2013-05-19 15:27:22 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:27:33 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:27:44 LNet: Service thread pid 19245 was inactive for 294.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 15:27:44 Pid: 19245, comm: mdt01_010 2013-05-19 15:27:44 2013-05-19 15:27:44 Call Trace: 2013-05-19 15:27:44 [] ? account_entity_enqueue+0x7e/0x90 2013-05-19 15:27:44 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 15:27:44 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 15:27:44 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 15:27:44 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 15:27:44 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 15:27:44 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 15:27:44 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 15:27:44 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 15:27:44 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 15:27:44 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 15:27:44 [] mdd_create+0x929/0x1770 [mdd] 2013-05-19 15:27:44 [] ? lod_index_lookup+0x0/0x30 [lod] 2013-05-19 15:27:44 [] mdt_reint_open+0x1422/0x2120 [mdt] 2013-05-19 15:27:44 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 15:27:44 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 15:27:44 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 15:27:44 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 15:27:44 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 15:27:44 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 15:27:44 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 15:27:44 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 15:27:44 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 15:27:44 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 15:27:44 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 15:27:44 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 15:27:44 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 15:27:44 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 15:27:44 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 15:27:44 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 15:27:44 [] ? __wake_up+0x53/0x70 2013-05-19 15:27:44 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 15:27:44 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:44 [] child_rip+0xa/0x20 2013-05-19 15:27:44 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:44 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:27:44 [] ? child_rip+0x0/0x20 2013-05-19 15:27:44 2013-05-19 15:27:44 LustreError: dumping log to /tmp/lustre-log.1369002464.19245 2013-05-19 15:27:47 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) reconnecting 2013-05-19 15:27:47 Lustre: Skipped 2 previous similar messages 2013-05-19 15:27:47 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:27:53 LNet: Service thread pid 19245 completed after 303.15s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 15:28:12 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:28:12 Lustre: Skipped 1 previous similar message 2013-05-19 15:29:02 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:29:02 Lustre: Skipped 1 previous similar message 2013-05-19 15:29:27 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) reconnecting 2013-05-19 15:29:27 Lustre: Skipped 6 previous similar messages 2013-05-19 15:29:30 LNet: Service thread pid 19002 completed after 330.70s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 15:48:05 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) reconnecting 2013-05-19 15:48:05 Lustre: Skipped 1 previous similar message 2013-05-19 15:48:05 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 15:48:05 Lustre: Skipped 1 previous similar message 2013-05-19 15:48:30 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) reconnecting 2013-05-19 15:48:30 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 15:48:49 LNet: Service thread pid 18395 was inactive for 200.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 15:48:49 Pid: 18395, comm: mdt_rdpg03_001 2013-05-19 15:48:49 2013-05-19 15:48:49 Call Trace: 2013-05-19 15:48:49 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 15:48:49 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 15:48:49 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 15:48:49 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 15:48:49 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 15:48:49 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 15:48:49 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 15:48:49 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 15:48:49 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 15:48:49 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 15:48:49 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 15:48:49 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 15:48:49 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 15:48:49 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 15:48:49 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 15:48:49 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 15:48:49 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 15:48:49 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 15:48:49 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 15:48:49 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 15:48:49 [] ? __wake_up+0x53/0x70 2013-05-19 15:48:49 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 15:48:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:48:49 [] child_rip+0xa/0x20 2013-05-19 15:48:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:48:49 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 15:48:49 [] ? child_rip+0x0/0x20 2013-05-19 15:48:49 2013-05-19 15:48:49 LustreError: dumping log to /tmp/lustre-log.1369003729.18395 2013-05-19 15:48:55 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 15:48:56 LNet: Service thread pid 18395 completed after 206.39s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 15:49:20 Lustre: fsv-MDT0000: Client 28201cb4-af31-2cd5-cf08-96faa0861631 (at 172.20.17.2@o2ib500) reconnecting 2013-05-19 15:49:20 Lustre: Skipped 1 previous similar message 2013-05-19 15:57:32 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 15:57:32 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:58:54 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 15:58:54 Lustre: Skipped 3 previous similar messages Console [vesta-mds1] log at 2013-05-19 16:00:00 PDT. 2013-05-19 16:00:30 Lustre: fsv-MDT0000: Client f17ad21c-daf2-8afb-41b9-23bac9d2d9eb (at 172.20.17.30@o2ib500) reconnecting 2013-05-19 16:00:30 Lustre: Skipped 7 previous similar messages 2013-05-19 16:27:24 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) reconnecting 2013-05-19 16:27:24 Lustre: Skipped 1 previous similar message 2013-05-19 16:27:24 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 16:27:24 Lustre: Skipped 2 previous similar messages 2013-05-19 16:29:36 Lustre: fsv-MDT0000: Client 000b40cf-3555-a2d4-e0f9-7163e31e2890 (at 172.20.17.51@o2ib500) reconnecting 2013-05-19 16:29:36 Lustre: fsv-MDT0000: Client bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53 (at 172.20.17.33@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 16:29:36 Lustre: Skipped 2 previous similar messages 2013-05-19 16:29:49 Lustre: lock timed out (enqueued at 1369005989, 200s ago) 2013-05-19 16:29:49 Lustre: Skipped 3 previous similar messages 2013-05-19 16:29:49 Lustre: lock timed out (enqueued at 1369005989, 200s ago) 2013-05-19 16:29:49 Lustre: Skipped 3 previous similar messages 2013-05-19 16:30:09 LNet: Service thread pid 19414 was inactive for 220.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:30:09 Pid: 19414, comm: mdt00_053 2013-05-19 16:30:09 2013-05-19 16:30:09 Call Trace: 2013-05-19 16:30:09 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:30:09 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:30:09 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:30:09 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:30:09 [] ? default_wake_function+0x0/0x20 2013-05-19 16:30:09 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:30:09 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:30:09 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:30:09 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 16:30:09 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:30:09 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:30:09 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:30:09 [] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 16:30:09 [] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-05-19 16:30:09 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 16:30:09 [] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2013-05-19 16:30:09 [] mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 16:30:09 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:30:09 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:30:09 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:30:09 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:30:09 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:30:09 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:30:09 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:30:09 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:30:09 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:30:09 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:30:09 [] ? default_wake_function+0x0/0x20 2013-05-19 16:30:09 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] child_rip+0xa/0x20 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? child_rip+0x0/0x20 2013-05-19 16:30:09 2013-05-19 16:30:09 LustreError: dumping log to /tmp/lustre-log.1369006209.19414 2013-05-19 16:30:09 LNet: Service thread pid 19265 was inactive for 220.23s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:30:09 Pid: 19265, comm: mdt00_019 2013-05-19 16:30:09 2013-05-19 16:30:09 Call Trace: 2013-05-19 16:30:09 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:30:09 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:30:09 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:30:09 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:30:09 [] ? default_wake_function+0x0/0x20 2013-05-19 16:30:09 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:30:09 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:30:09 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:30:09 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:30:09 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:30:09 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:30:09 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:30:09 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:30:09 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:30:09 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:30:09 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:30:09 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:30:09 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:30:09 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:30:09 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:30:09 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:30:09 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:30:09 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:30:09 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:30:09 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:30:09 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:30:09 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:30:09 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:30:09 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:30:09 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:30:09 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:30:09 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:30:09 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:30:09 [] ? __wake_up+0x53/0x70 2013-05-19 16:30:09 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] child_rip+0xa/0x20 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? child_rip+0x0/0x20 2013-05-19 16:30:09 2013-05-19 16:30:09 Pid: 19308, comm: mdt00_032 2013-05-19 16:30:09 2013-05-19 16:30:09 Call Trace: 2013-05-19 16:30:09 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 16:30:09 [] ? wake_up_process+0x15/0x20 2013-05-19 16:30:09 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 16:30:09 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 16:30:09 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 16:30:09 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 16:30:09 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 16:30:09 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 16:30:09 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 16:30:09 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 16:30:09 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 16:30:09 [] mdd_create+0x929/0x1770 [mdd] 2013-05-19 16:30:09 [] ? lod_index_lookup+0x0/0x30 [lod] 2013-05-19 16:30:09 [] mdt_reint_open+0x1422/0x2120 [mdt] 2013-05-19 16:30:09 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:30:09 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:30:09 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:30:09 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:30:09 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:30:09 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:30:09 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:30:09 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:30:09 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:30:09 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:30:09 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:30:09 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:30:09 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:30:09 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:30:09 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:30:09 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:30:09 [] ? __wake_up+0x53/0x70 2013-05-19 16:30:09 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] child_rip+0xa/0x20 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:30:09 [] ? child_rip+0x0/0x20 2013-05-19 16:30:09 2013-05-19 16:30:26 Lustre: fsv-MDT0000: Client 000b40cf-3555-a2d4-e0f9-7163e31e2890 (at 172.20.17.51@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 16:30:26 Lustre: Skipped 4 previous similar messages 2013-05-19 16:30:51 LNet: Service thread pid 19308 completed after 262.62s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 16:30:51 LNet: Skipped 1 previous similar message 2013-05-19 16:31:16 Lustre: fsv-MDT0000: Client bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53 (at 172.20.17.33@o2ib500) reconnecting 2013-05-19 16:31:16 Lustre: Skipped 7 previous similar messages 2013-05-19 16:31:40 Lustre: fsv-MDT0000: Client bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53 (at 172.20.17.33@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 16:31:40 Lustre: Skipped 3 previous similar messages 2013-05-19 16:33:28 Lustre: lock timed out (enqueued at 1369006094, 314s ago) 2013-05-19 16:33:28 Lustre: Skipped 4 previous similar messages 2013-05-19 16:33:49 Lustre: fsv-MDT0000: Client 1ab9e6c9-242d-ebad-23e5-59a72978ad44 (at 172.20.17.35@o2ib500) reconnecting 2013-05-19 16:33:49 Lustre: Skipped 13 previous similar messages 2013-05-19 16:33:49 Lustre: fsv-MDT0000: Client 1ab9e6c9-242d-ebad-23e5-59a72978ad44 (at 172.20.17.35@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 16:33:49 Lustre: Skipped 12 previous similar messages 2013-05-19 16:34:53 LNet: Service thread pid 19336 was inactive for 504.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:34:53 LNet: Skipped 1 previous similar message 2013-05-19 16:34:53 Pid: 19336, comm: mdt01_036 2013-05-19 16:34:53 2013-05-19 16:34:53 Call Trace: 2013-05-19 16:34:53 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:34:53 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:34:53 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:34:53 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:34:53 [] ? default_wake_function+0x0/0x20 2013-05-19 16:34:53 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:34:53 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:34:53 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:34:53 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:34:53 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:34:53 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:34:53 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:34:53 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:34:53 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:34:53 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:34:53 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:34:53 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:34:53 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:34:53 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:34:53 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:34:53 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:34:53 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:34:53 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:34:53 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:34:53 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:34:53 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:34:53 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:34:53 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:34:53 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:34:53 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:34:53 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:34:53 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:34:53 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:34:53 [] ? __wake_up+0x53/0x70 2013-05-19 16:34:53 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:34:53 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:34:53 [] child_rip+0xa/0x20 2013-05-19 16:34:53 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:34:53 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:34:53 [] ? child_rip+0x0/0x20 2013-05-19 16:34:53 2013-05-19 16:34:53 LustreError: dumping log to /tmp/lustre-log.1369006493.19336 2013-05-19 16:35:37 LNet: Service thread pid 19296 was inactive for 548.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:35:37 Pid: 19296, comm: mdt03_027 2013-05-19 16:35:37 2013-05-19 16:35:37 Call Trace: 2013-05-19 16:35:37 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:35:37 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:35:37 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:35:37 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:35:37 [] ? default_wake_function+0x0/0x20 2013-05-19 16:35:37 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:35:37 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:37 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:37 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:35:37 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:37 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:37 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:35:37 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:35:37 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:35:37 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:35:37 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:35:37 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:37 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:35:37 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:35:37 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:37 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:35:37 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:35:37 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:35:37 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:35:37 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:35:37 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:35:37 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:35:37 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:35:37 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:35:37 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:35:37 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:35:37 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:35:37 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:35:37 [] ? __wake_up+0x53/0x70 2013-05-19 16:35:37 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:35:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:37 [] child_rip+0xa/0x20 2013-05-19 16:35:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:37 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:37 [] ? child_rip+0x0/0x20 2013-05-19 16:35:37 2013-05-19 16:35:37 LustreError: dumping log to /tmp/lustre-log.1369006537.19296 2013-05-19 16:35:38 Pid: 19939, comm: mdt02_051 2013-05-19 16:35:38 2013-05-19 16:35:38 Call Trace: 2013-05-19 16:35:38 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:35:38 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:35:38 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:35:38 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:35:38 [] ? default_wake_function+0x0/0x20 2013-05-19 16:35:38 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:35:38 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:38 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:38 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:35:38 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:38 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:38 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:35:38 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:35:38 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:35:38 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:35:38 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:35:38 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:38 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:35:38 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:35:38 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:38 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:35:38 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:35:38 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:35:38 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:35:38 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:35:38 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:35:38 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:35:38 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:35:38 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:35:38 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:35:38 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:35:38 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:35:38 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:35:38 [] ? __wake_up+0x53/0x70 2013-05-19 16:35:38 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:35:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:38 [] child_rip+0xa/0x20 2013-05-19 16:35:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:38 [] ? child_rip+0x0/0x20 2013-05-19 16:35:38 2013-05-19 16:35:38 LustreError: dumping log to /tmp/lustre-log.1369006538.19939 2013-05-19 16:35:38 Pid: 19220, comm: mdt02_012 2013-05-19 16:35:38 2013-05-19 16:35:38 Call Trace: 2013-05-19 16:35:38 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:35:38 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:35:38 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:35:38 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:35:38 [] ? default_wake_function+0x0/0x20 2013-05-19 16:35:38 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:35:38 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:38 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:38 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:35:38 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:38 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:38 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:35:38 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:35:38 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:35:38 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:35:38 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:35:38 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:38 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:35:38 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:35:38 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:35:38 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:35:38 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:35:38 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:35:38 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:35:38 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:35:39 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:35:39 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:35:39 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:35:39 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:35:39 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:35:39 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:35:39 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:35:39 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:35:39 [] ? __wake_up+0x53/0x70 2013-05-19 16:35:39 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] child_rip+0xa/0x20 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? child_rip+0x0/0x20 2013-05-19 16:35:39 2013-05-19 16:35:39 Pid: 19947, comm: mdt02_055 2013-05-19 16:35:39 2013-05-19 16:35:39 Call Trace: 2013-05-19 16:35:39 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:35:39 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:35:39 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:35:39 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:35:39 [] ? default_wake_function+0x0/0x20 2013-05-19 16:35:39 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:35:39 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:39 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:39 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 16:35:39 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:39 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:39 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:35:39 [] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 16:35:39 [] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-05-19 16:35:39 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 16:35:39 [] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2013-05-19 16:35:39 [] mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 16:35:39 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:35:39 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:35:39 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:35:39 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:35:39 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:35:39 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:35:39 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:35:39 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:35:39 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:35:39 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:35:39 [] ? __wake_up+0x53/0x70 2013-05-19 16:35:39 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] child_rip+0xa/0x20 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? child_rip+0x0/0x20 2013-05-19 16:35:39 2013-05-19 16:35:39 LNet: Service thread pid 19960 was inactive for 444.68s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:35:39 LNet: Skipped 3 previous similar messages 2013-05-19 16:35:39 Pid: 19960, comm: mdt02_066 2013-05-19 16:35:39 2013-05-19 16:35:39 Call Trace: 2013-05-19 16:35:39 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:35:39 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:35:39 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:35:39 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:35:39 [] ? default_wake_function+0x0/0x20 2013-05-19 16:35:39 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:35:39 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:39 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:39 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 16:35:39 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:35:39 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:35:39 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:35:39 [] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 16:35:39 [] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-05-19 16:35:39 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 16:35:39 [] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2013-05-19 16:35:39 [] mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 16:35:39 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:35:39 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:35:39 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:35:39 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:35:39 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:35:39 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:35:39 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:35:39 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:35:39 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:35:39 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:35:39 [] ? __wake_up+0x53/0x70 2013-05-19 16:35:39 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] child_rip+0xa/0x20 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:35:39 [] ? child_rip+0x0/0x20 2013-05-19 16:35:39 2013-05-19 16:37:09 Lustre: 19368:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-40), not sending early reply 2013-05-19 16:37:09 req@ffff8800aefcf800 x1435223439814320/t0(0) o101->1ab9e6c9-242d-ebad-23e5-59a72978ad44@172.20.17.35@o2ib500:0/0 lens 584/1152 e 1 to 0 dl 1369006639 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:37:22 LNet: Service thread pid 19291 was inactive for 548.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. 2013-05-19 16:37:22 LustreError: dumping log to /tmp/lustre-log.1369006642.19291 2013-05-19 16:37:25 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 656s: evicting client at 172.20.17.51@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880059f52a00/0x48db03b0804704a9 lrc: 3/0,0 mode: PR/PR res: 8589946734/56834 bits 0x13 rrc: 26 type: IBT flags: 0x200400000020 nid: 172.20.17.51@o2ib500 remote: 0x13e33f995e1810ec expref: 476 pid: 19414 timeout: 4396821365 lvb_type: 0 2013-05-19 16:37:25 Lustre: 19336:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (650:6s); client may timeout. req@ffff8800aefcf800 x1435223439814320/t21513712446(0) o101->1ab9e6c9-242d-ebad-23e5-59a72978ad44@172.20.17.35@o2ib500:0/0 lens 584/600 e 1 to 0 dl 1369006639 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 16:37:25 LNet: Service thread pid 19265 completed after 656.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 16:37:50 LNet: Service thread pid 19947 completed after 576.24s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 16:37:50 LNet: Skipped 3 previous similar messages 2013-05-19 16:38:06 Lustre: fsv-MDT0000: Client aa10d38b-cba8-2b46-7380-386ced72b68b (at 172.20.17.52@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 16:38:06 Lustre: Skipped 63 previous similar messages 2013-05-19 16:38:51 Lustre: fsv-MDT0000: Client 8190fef7-d1d5-fb3a-f62e-69d18fbcab41 (at 172.20.17.36@o2ib500) reconnecting 2013-05-19 16:38:51 Lustre: Skipped 72 previous similar messages 2013-05-19 16:39:50 Lustre: 21574:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-96), not sending early reply 2013-05-19 16:39:50 req@ffff880be243c000 x1435223490376664/t0(0) o101->aa10d38b-cba8-2b46-7380-386ced72b68b@172.20.17.52@o2ib500:0/0 lens 584/1152 e 1 to 0 dl 1369006800 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:40:00 LNet: Service thread pid 19224 was inactive for 524.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. 2013-05-19 16:40:00 LNet: Skipped 1 previous similar message 2013-05-19 16:40:00 LustreError: dumping log to /tmp/lustre-log.1369006799.19224 2013-05-19 16:42:19 Lustre: 18998:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-64), not sending early reply 2013-05-19 16:42:19 req@ffff880076df1000 x1435223483849584/t0(0) o101->000b40cf-3555-a2d4-e0f9-7163e31e2890@172.20.17.51@o2ib500:0/0 lens 584/1152 e 1 to 0 dl 1369006949 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:42:19 Lustre: 18998:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message 2013-05-19 16:43:05 LNet: Service thread pid 19330 was inactive for 734.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:43:05 Pid: 19330, comm: mdt03_036 2013-05-19 16:43:05 2013-05-19 16:43:05 Call Trace: 2013-05-19 16:43:05 [] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs] 2013-05-19 16:43:05 [] schedule_timeout+0x192/0x2e0 2013-05-19 16:43:05 [] ? process_timeout+0x0/0x10 2013-05-19 16:43:05 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 16:43:05 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 16:43:05 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:43:05 [] ? default_wake_function+0x0/0x20 2013-05-19 16:43:05 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:43:05 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:05 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:05 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:43:05 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:05 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:05 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:43:05 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:43:05 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:43:05 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:43:05 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:43:05 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:43:05 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:43:05 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:43:05 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:43:05 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:43:05 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:43:05 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:43:05 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:43:05 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:43:05 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:43:05 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:43:05 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:43:05 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:43:05 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:43:05 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:43:05 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:43:05 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:43:05 [] ? __wake_up+0x53/0x70 2013-05-19 16:43:05 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:43:05 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:05 [] child_rip+0xa/0x20 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? child_rip+0x0/0x20 2013-05-19 16:43:06 2013-05-19 16:43:06 LustreError: dumping log to /tmp/lustre-log.1369006986.19330 2013-05-19 16:43:06 Pid: 21577, comm: mdt03_067 2013-05-19 16:43:06 2013-05-19 16:43:06 Call Trace: 2013-05-19 16:43:06 [] schedule_timeout+0x192/0x2e0 2013-05-19 16:43:06 [] ? process_timeout+0x0/0x10 2013-05-19 16:43:06 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 16:43:06 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 16:43:06 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:43:06 [] ? default_wake_function+0x0/0x20 2013-05-19 16:43:06 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:43:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:06 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 16:43:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:06 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:43:06 [] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 16:43:06 [] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-05-19 16:43:06 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 16:43:06 [] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2013-05-19 16:43:06 [] mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 16:43:06 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:43:06 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:43:06 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:43:06 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:43:06 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:43:06 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:43:06 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:43:06 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:43:06 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:43:06 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:43:06 [] ? default_wake_function+0x0/0x20 2013-05-19 16:43:06 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] child_rip+0xa/0x20 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? child_rip+0x0/0x20 2013-05-19 16:43:06 2013-05-19 16:43:06 Pid: 19221, comm: mdt03_008 2013-05-19 16:43:06 2013-05-19 16:43:06 Call Trace: 2013-05-19 16:43:06 [] schedule_timeout+0x192/0x2e0 2013-05-19 16:43:06 [] ? process_timeout+0x0/0x10 2013-05-19 16:43:06 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 16:43:06 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 16:43:06 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:43:06 [] ? default_wake_function+0x0/0x20 2013-05-19 16:43:06 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:43:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:06 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 16:43:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:43:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:43:06 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:43:06 [] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 16:43:06 [] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2013-05-19 16:43:06 [] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2013-05-19 16:43:06 [] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2013-05-19 16:43:06 [] mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 16:43:06 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:43:06 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:43:06 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:43:06 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:43:06 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:43:06 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:43:06 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:43:06 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:43:06 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:43:06 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:43:06 [] ? __wake_up+0x53/0x70 2013-05-19 16:43:06 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] child_rip+0xa/0x20 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:43:06 [] ? child_rip+0x0/0x20 2013-05-19 16:43:06 2013-05-19 16:43:28 Lustre: lock timed out (enqueued at 1369006808, 200s ago) 2013-05-19 16:43:53 Lustre: lock timed out (enqueued at 1369006827, 206s ago) 2013-05-19 16:43:53 Lustre: Skipped 3 previous similar messages 2013-05-19 16:43:57 Lustre: lock timed out (enqueued at 1369006251, 786s ago) 2013-05-19 16:43:57 Lustre: Skipped 2 previous similar messages 2013-05-19 16:44:21 Lustre: lock timed out (enqueued at 1369006275, 786s ago) 2013-05-19 16:44:21 Lustre: Skipped 1 previous similar message 2013-05-19 16:46:46 Lustre: fsv-MDT0000: Client 8190fef7-d1d5-fb3a-f62e-69d18fbcab41 (at 172.20.17.36@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 16:46:46 Lustre: Skipped 100 previous similar messages 2013-05-19 16:47:22 Lustre: 19328:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-548), not sending early reply 2013-05-19 16:47:22 req@ffff881000646400 x1435223491919788/t0(0) o101->6bcbefb5-616e-f815-1f4e-8257483c86e7@172.20.17.50@o2ib500:0/0 lens 584/1152 e 2 to 0 dl 1369007252 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:47:51 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1177s: evicting client at 172.20.17.34@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880ace6a2000/0x48db03b0804be6ed lrc: 3/0,0 mode: PR/PR res: 8589946734/56834 bits 0x13 rrc: 23 type: IBT flags: 0x200400000020 nid: 172.20.17.34@o2ib500 remote: 0x96299255e3d10836 expref: 844 pid: 19947 timeout: 4397447463 lvb_type: 0 2013-05-19 16:47:51 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1177s: evicting client at 172.20.17.34@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880bcb0b9e00/0x48db03b0804be6e6 lrc: 3/0,0 mode: PR/PR res: 8589946734/56834 bits 0x13 rrc: 23 type: IBT flags: 0x200400000020 nid: 172.20.17.34@o2ib500 remote: 0x8f299255e3d10836 expref: 844 pid: 19960 timeout: 4397447463 lvb_type: 0 2013-05-19 16:47:51 Lustre: 21571:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (706:471s); client may timeout. req@ffff880be243c000 x1435223490376664/t21513816341(0) o101->aa10d38b-cba8-2b46-7380-386ced72b68b@172.20.17.52@o2ib500:0/0 lens 584/600 e 1 to 0 dl 1369006800 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 16:47:51 LNet: Service thread pid 19291 completed after 1176.86s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 16:47:51 Lustre: 21571:0:(service.c:1995:ptlrpc_server_handle_request()) Skipped 1 previous similar message 2013-05-19 16:48:14 Lustre: 21574:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-443), not sending early reply 2013-05-19 16:48:14 req@ffff880b32118c00 x1435223442908652/t0(0) o101->8190fef7-d1d5-fb3a-f62e-69d18fbcab41@172.20.17.36@o2ib500:0/0 lens 592/3304 e 1 to 0 dl 1369007304 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:48:14 Lustre: 21574:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message 2013-05-19 16:48:54 Lustre: fsv-MDT0000: Client 9a0a0e50-962d-2093-8bae-c1adc19bd31d (at 172.20.17.49@o2ib500) reconnecting 2013-05-19 16:48:54 Lustre: Skipped 115 previous similar messages 2013-05-19 16:49:55 Lustre: 19302:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 16:49:55 req@ffff880078c14000 x1435223453275576/t0(0) o101->bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53@172.20.17.33@o2ib500:0/0 lens 584/1152 e 0 to 0 dl 1369007405 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 16:49:55 Lustre: 19302:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages 2013-05-19 16:56:49 Lustre: fsv-MDT0000: Client 9a0a0e50-962d-2093-8bae-c1adc19bd31d (at 172.20.17.49@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 16:56:49 Lustre: Skipped 115 previous similar messages 2013-05-19 16:57:30 LNet: Service thread pid 19411 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:57:30 Lustre: lock timed out (enqueued at 1369006650, 1200s ago) 2013-05-19 16:57:30 LNet: Skipped 2 previous similar messages 2013-05-19 16:57:30 Pid: 19411, comm: mdt00_052 2013-05-19 16:57:30 2013-05-19 16:57:30 Call Trace: 2013-05-19 16:57:30 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:57:30 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 16:57:30 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 16:57:30 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:57:30 [] ? default_wake_function+0x0/0x20 2013-05-19 16:57:30 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:57:30 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:57:30 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:57:30 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:57:30 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:57:30 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:57:30 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:57:30 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:57:30 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:57:30 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:57:30 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:57:30 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:57:30 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:57:30 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:57:30 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:57:30 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:57:30 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:57:30 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:57:30 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:57:30 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:57:30 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:57:30 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:57:30 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:57:31 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:57:31 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:57:31 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:57:31 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:57:31 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:57:31 [] ? __wake_up+0x53/0x70 2013-05-19 16:57:31 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:57:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:31 [] child_rip+0xa/0x20 2013-05-19 16:57:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:31 [] ? child_rip+0x0/0x20 2013-05-19 16:57:31 2013-05-19 16:57:31 LustreError: dumping log to /tmp/lustre-log.1369007851.19411 2013-05-19 16:57:34 Lustre: lock timed out (enqueued at 1369006654, 1200s ago) 2013-05-19 16:57:34 LNet: Service thread pid 19292 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 16:57:34 Pid: 19292, comm: mdt01_024 2013-05-19 16:57:34 2013-05-19 16:57:34 Call Trace: 2013-05-19 16:57:34 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 16:57:34 [] ? ldlm_expired_completion_wait+0x205/0x390 [ptlrpc] 2013-05-19 16:57:34 [] ? ldlm_completion_ast+0x508/0x960 [ptlrpc] 2013-05-19 16:57:34 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 16:57:34 [] ? default_wake_function+0x0/0x20 2013-05-19 16:57:34 [] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 16:57:34 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:57:34 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:57:34 [] ? mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 16:57:34 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 16:57:34 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 16:57:34 [] ? mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 16:57:34 [] ? mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 16:57:34 [] ? mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 16:57:34 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 16:57:34 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 16:57:34 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:57:34 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 16:57:34 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 16:57:34 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 16:57:34 [] ? mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 16:57:34 [] ? mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 16:57:34 [] ? mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 16:57:34 [] ? mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 16:57:34 [] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 16:57:34 [] ? ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 16:57:34 [] ? mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 16:57:34 [] ? mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 16:57:34 [] ? mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 16:57:34 [] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 16:57:34 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 16:57:34 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 16:57:34 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 16:57:34 [] ? __wake_up+0x53/0x70 2013-05-19 16:57:34 [] ? ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 16:57:34 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:34 [] ? child_rip+0xa/0x20 2013-05-19 16:57:34 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:34 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 16:57:34 [] ? child_rip+0x0/0x20 2013-05-19 16:57:34 2013-05-19 16:57:34 LustreError: dumping log to /tmp/lustre-log.1369007854.19292 2013-05-19 16:57:52 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1778s: evicting client at 172.20.17.52@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880b47b97c00/0x48db03b0804be717 lrc: 3/0,0 mode: PR/PR res: 8589946734/56834 bits 0x13 rrc: 22 type: IBT flags: 0x200400000020 nid: 172.20.17.52@o2ib500 remote: 0x990c767355b81bc5 expref: 424 pid: 19291 timeout: 4398048085 lvb_type: 0 2013-05-19 16:57:52 Lustre: 19220:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1158:620s); client may timeout. req@ffff881000646400 x1435223491919788/t21514129404(0) o101->6bcbefb5-616e-f815-1f4e-8257483c86e7@172.20.17.50@o2ib500:0/0 lens 584/600 e 2 to 0 dl 1369007252 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 16:57:52 LNet: Service thread pid 19330 completed after 1620.68s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 16:57:52 LNet: Skipped 1 previous similar message 2013-05-19 16:57:52 Lustre: 19220:0:(service.c:1995:ptlrpc_server_handle_request()) Skipped 1 previous similar message 2013-05-19 16:59:10 Lustre: fsv-MDT0000: Client bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53 (at 172.20.17.33@o2ib500) reconnecting 2013-05-19 16:59:10 Lustre: Skipped 122 previous similar messages Console [vesta-mds1] log at 2013-05-19 17:00:00 PDT. 2013-05-19 17:00:21 LustreError: 19939:0:(ldlm_lockd.c:1366:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fffacf800 ns: mdt-ffff88078ca74000 lock: ffff880c03a9d600/0x48db03b081362737 lrc: 3/0,0 mode: CR/CR res: 8589946738/59858 bits 0x9 rrc: 1 type: IBT flags: 0x200000000000 nid: 172.20.17.34@o2ib500 remote: 0x9d299255e3d10836 expref: 4 pid: 19939 timeout: 0 lvb_type: 0 2013-05-19 17:00:21 Lustre: 21577:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1053:717s); client may timeout. req@ffff8810180d8400 x1435223488730440/t0(0) o101->9a0a0e50-962d-2093-8bae-c1adc19bd31d@172.20.17.49@o2ib500:0/0 lens 592/536 e 1 to 0 dl 1369007304 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 17:00:21 LNet: Service thread pid 21577 completed after 1769.67s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 17:00:21 LNet: Skipped 1 previous similar message 2013-05-19 17:00:21 LustreError: 19939:0:(ldlm_lockd.c:1366:ldlm_handle_enqueue0()) Skipped 1 previous similar message 2013-05-19 17:00:31 Lustre: 20160:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 17:00:31 req@ffff880fefe80000 x1435223490395308/t0(0) o101->aa10d38b-cba8-2b46-7380-386ced72b68b@172.20.17.52@o2ib500:0/0 lens 584/1152 e 0 to 0 dl 1369008041 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 17:00:31 Lustre: 20160:0:(service.c:1296:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message 2013-05-19 17:03:33 Lustre: 21576:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-327), not sending early reply 2013-05-19 17:03:33 req@ffff880fff046400 x1435223490376668/t0(0) o101->aa10d38b-cba8-2b46-7380-386ced72b68b@172.20.17.52@o2ib500:0/0 lens 592/3304 e 1 to 0 dl 1369008223 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 17:07:04 Lustre: fsv-MDT0000: Client bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53 (at 172.20.17.33@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 17:07:04 Lustre: Skipped 73 previous similar messages 2013-05-19 17:08:06 Lustre: lock timed out (enqueued at 1369007286, 1200s ago) 2013-05-19 17:08:06 LNet: Service thread pid 21571 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 17:08:06 Pid: 21571, comm: mdt03_061 2013-05-19 17:08:06 2013-05-19 17:08:06 Call Trace: 2013-05-19 17:08:06 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 17:08:06 [] ? ldlm_expired_completion_wait+0x205/0x390 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_completion_ast+0x508/0x960 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 17:08:06 [] ? default_wake_function+0x0/0x20 2013-05-19 17:08:06 [] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:08:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:08:06 [] ? mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 17:08:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:08:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:08:06 [] ? mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 17:08:06 [] ? mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2013-05-19 17:08:06 [] ? cfs_hash_lookup+0x82/0xa0 [libcfs] 2013-05-19 17:08:06 [] ? lustre_msg_clear_flags+0x6c/0xc0 [ptlrpc] 2013-05-19 17:08:06 [] ? mdt_intent_getattr+0x29d/0x490 [mdt] 2013-05-19 17:08:06 [] ? mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 17:08:06 [] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 17:08:06 [] ? mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 17:08:06 [] ? mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 17:08:06 [] ? mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 17:08:06 [] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 17:08:06 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 17:08:06 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 17:08:06 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 17:08:06 [] ? __wake_up+0x53/0x70 2013-05-19 17:08:06 [] ? ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 17:08:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:06 [] ? child_rip+0xa/0x20 2013-05-19 17:08:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:06 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:06 [] ? child_rip+0x0/0x20 2013-05-19 17:08:06 2013-05-19 17:08:06 LustreError: dumping log to /tmp/lustre-log.1369008486.21571 2013-05-19 17:08:06 Lustre: Skipped 1 previous similar message 2013-05-19 17:08:06 Pid: 19217, comm: mdt03_007 2013-05-19 17:08:06 2013-05-19 17:08:06 Call Trace: 2013-05-19 17:08:06 [] cfs_waitq_wait+0xe/0x10 [libcfs] 2013-05-19 17:08:06 [] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 17:08:06 [] ? default_wake_function+0x0/0x20 2013-05-19 17:08:06 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 17:08:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:08:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:08:06 [] mdt_object_lock0+0x28c/0xaf0 [mdt] 2013-05-19 17:08:06 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:08:06 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:08:06 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 17:08:06 [] mdt_object_find_lock+0x61/0x170 [mdt] 2013-05-19 17:08:06 [] mdt_reint_open+0x8dc/0x2120 [mdt] 2013-05-19 17:08:06 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 17:08:06 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 17:08:06 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 17:08:06 [] ? mdt_ucred+0x15/0x20 [mdt] 2013-05-19 17:08:06 [] ? mdt_root_squash+0x2c/0x410 [mdt] 2013-05-19 17:08:06 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 17:08:06 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 17:08:06 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 17:08:06 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 17:08:06 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 17:08:06 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 17:08:06 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 17:08:06 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 17:08:07 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 17:08:07 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 17:08:07 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 17:08:07 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 17:08:07 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 17:08:07 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 17:08:07 [] ? __wake_up+0x53/0x70 2013-05-19 17:08:07 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 17:08:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:07 [] child_rip+0xa/0x20 2013-05-19 17:08:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:08:07 [] ? child_rip+0x0/0x20 2013-05-19 17:08:07 2013-05-19 17:09:14 Lustre: fsv-MDT0000: Client 1ab9e6c9-242d-ebad-23e5-59a72978ad44 (at 172.20.17.35@o2ib500) reconnecting 2013-05-19 17:09:14 Lustre: Skipped 66 previous similar messages 2013-05-19 17:10:22 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 2371s: evicting client at 172.20.17.49@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880b5cc57800/0x48db03b0804ccba8 lrc: 3/0,0 mode: PR/PR res: 8589946734/56834 bits 0x13 rrc: 22 type: IBT flags: 0x200400000020 nid: 172.20.17.49@o2ib500 remote: 0x84225fc841fe57e7 expref: 514 pid: 21577 timeout: 4398798044 lvb_type: 0 2013-05-19 17:10:22 Lustre: 19411:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:1217s); client may timeout. req@ffff880078c14000 x1435223453275576/t21514287733(0) o101->bab05b1f-f18a-5ac6-fd7a-5ec9f8672c53@172.20.17.33@o2ib500:0/0 lens 584/600 e 0 to 0 dl 1369007405 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 17:10:22 LNet: Service thread pid 19292 completed after 1968.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 17:10:22 LNet: Skipped 3 previous similar messages 2013-05-19 17:10:22 LustreError: 19470:0:(ldlm_lockd.c:1366:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8807dc231800 ns: mdt-ffff88078ca74000 lock: ffff8800a47cbc00/0x48db03b08199ffb0 lrc: 3/0,0 mode: CR/CR res: 8589946815/178 bits 0x9 rrc: 1 type: IBT flags: 0x200000000000 nid: 172.20.17.51@o2ib500 remote: 0x21e33f995e1810ec expref: 4 pid: 19470 timeout: 0 lvb_type: 0 2013-05-19 17:10:22 Lustre: 19411:0:(service.c:1995:ptlrpc_server_handle_request()) Skipped 7 previous similar messages 2013-05-19 17:17:30 Lustre: fsv-MDT0000: Client ac2f1f2a-9d4f-f50a-ed2d-9a176885ef63 (at 172.20.17.24@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 17:17:30 Lustre: Skipped 23 previous similar messages 2013-05-19 17:17:43 Lustre: lock timed out (enqueued at 1369008863, 200s ago) 2013-05-19 17:28:54 Lustre: fsv-MDT0000: Client a707e637-e529-165a-d3f6-589f52d895c9 (at 172.20.17.9@o2ib500) reconnecting 2013-05-19 17:28:54 Lustre: Skipped 14 previous similar messages 2013-05-19 17:28:54 Lustre: fsv-MDT0000: Client a707e637-e529-165a-d3f6-589f52d895c9 (at 172.20.17.9@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 17:43:50 Lustre: fsv-MDT0000: Client fe31a136-657a-7cdb-5108-426534e71cba (at 172.20.17.142@o2ib500) reconnecting 2013-05-19 17:43:50 Lustre: Skipped 1 previous similar message 2013-05-19 17:43:50 Lustre: fsv-MDT0000: Client fe31a136-657a-7cdb-5108-426534e71cba (at 172.20.17.142@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 17:45:38 LNet: Service thread pid 23602 was inactive for 230.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 17:45:38 LNet: Skipped 1 previous similar message 2013-05-19 17:45:38 Pid: 23602, comm: mdt_rdpg01_017 2013-05-19 17:45:38 2013-05-19 17:45:38 Call Trace: 2013-05-19 17:45:38 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 17:45:38 [] ? __mutex_lock_slowpath+0x70/0x180 2013-05-19 17:45:38 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 17:45:38 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 17:45:38 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 17:45:38 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 17:45:38 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 17:45:38 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 17:45:38 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 17:45:38 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 17:45:38 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 17:45:38 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 17:45:38 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 17:45:38 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 17:45:38 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 17:45:38 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 17:45:38 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 17:45:38 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 17:45:38 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 17:45:38 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 17:45:38 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 17:45:38 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 17:45:38 [] ? __wake_up+0x53/0x70 2013-05-19 17:45:38 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 17:45:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:38 [] child_rip+0xa/0x20 2013-05-19 17:45:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:38 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:38 [] ? child_rip+0x0/0x20 2013-05-19 17:45:38 2013-05-19 17:45:38 LustreError: dumping log to /tmp/lustre-log.1369010738.23602 2013-05-19 17:45:45 LNet: Service thread pid 19353 was inactive for 436.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 17:45:45 Pid: 19353, comm: mdt01_042 2013-05-19 17:45:45 2013-05-19 17:45:45 Call Trace: 2013-05-19 17:45:45 [] schedule_timeout+0x192/0x2e0 2013-05-19 17:45:45 [] ? process_timeout+0x0/0x10 2013-05-19 17:45:45 [] cfs_waitq_timedwait+0x11/0x20 [libcfs] 2013-05-19 17:45:45 [] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] 2013-05-19 17:45:45 [] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2013-05-19 17:45:45 [] ? default_wake_function+0x0/0x20 2013-05-19 17:45:45 [] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2013-05-19 17:45:45 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:45:45 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:45:45 [] mdt_object_lock0+0x33b/0xaf0 [mdt] 2013-05-19 17:45:45 [] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2013-05-19 17:45:45 [] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2013-05-19 17:45:45 [] mdt_object_lock+0x14/0x20 [mdt] 2013-05-19 17:45:45 [] mdt_attr_set+0x8f/0x560 [mdt] 2013-05-19 17:45:45 [] mdt_reint_setattr+0x5f4/0xd10 [mdt] 2013-05-19 17:45:45 [] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2013-05-19 17:45:45 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 17:45:45 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 17:45:45 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 17:45:45 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 17:45:45 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 17:45:45 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 17:45:45 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 17:45:45 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 17:45:45 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 17:45:45 [] ? __wake_up+0x53/0x70 2013-05-19 17:45:45 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 17:45:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:45 [] child_rip+0xa/0x20 2013-05-19 17:45:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:45:45 [] ? child_rip+0x0/0x20 2013-05-19 17:45:45 2013-05-19 17:45:45 LustreError: dumping log to /tmp/lustre-log.1369010745.19353 2013-05-19 17:46:56 LNet: Service thread pid 23602 completed after 308.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 17:46:56 LNet: Skipped 4 previous similar messages 2013-05-19 17:47:30 LNet: Service thread pid 19353 completed after 541.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 17:48:56 LNet: Service thread pid 19300 was inactive for 714.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 17:48:56 Pid: 19300, comm: mdt03_028 2013-05-19 17:48:56 2013-05-19 17:48:56 Call Trace: 2013-05-19 17:48:56 [] ? thread_return+0x4e/0x76e 2013-05-19 17:48:56 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 17:48:56 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 17:48:56 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 17:48:56 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 17:48:56 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 17:48:56 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 17:48:56 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 17:48:56 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 17:48:56 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 17:48:56 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 17:48:56 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 17:48:56 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 17:48:56 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 17:48:56 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 17:48:56 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 17:48:56 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 17:48:56 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 17:48:56 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 17:48:56 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 17:48:56 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 17:48:56 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 17:48:56 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 17:48:56 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 17:48:56 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 17:48:56 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 17:48:56 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 17:48:56 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 17:48:56 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 17:48:56 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 17:48:56 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 17:48:56 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 17:48:56 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 17:48:56 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 17:48:56 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 17:48:56 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 17:48:56 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 17:48:56 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 17:48:56 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 17:48:56 [] ? __wake_up+0x53/0x70 2013-05-19 17:48:56 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 17:48:56 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:48:56 [] child_rip+0xa/0x20 2013-05-19 17:48:56 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:48:56 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 17:48:56 [] ? child_rip+0x0/0x20 2013-05-19 17:48:56 2013-05-19 17:48:56 LustreError: dumping log to /tmp/lustre-log.1369010936.19300 2013-05-19 17:51:20 LNet: Service thread pid 19300 completed after 858.15s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 17:52:00 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 103815s: evicting client at 172.20.17.73@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880fde655000/0x48db03b044a02d92 lrc: 3/0,0 mode: PR/PR res: 8589946711/136 bits 0x1b rrc: 4 type: IBT flags: 0x200000000020 nid: 172.20.17.73@o2ib500 remote: 0x8dfb00a0ae4a4799 expref: 295 pid: 19268 timeout: 4401296686 lvb_type: 0 2013-05-19 17:52:00 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message 2013-05-19 17:55:40 Lustre: lock timed out (enqueued at 1369011080, 260s ago) 2013-05-19 17:55:40 Lustre: Skipped 1 previous similar message 2013-05-19 17:58:34 Lustre: fsv-MDT0000: Client 54c15dda-ee9c-47b9-fd0f-3a3395ede27b (at 172.20.17.57@o2ib500) reconnecting 2013-05-19 17:58:34 Lustre: Skipped 114 previous similar messages 2013-05-19 17:58:34 Lustre: fsv-MDT0000: Client 54c15dda-ee9c-47b9-fd0f-3a3395ede27b (at 172.20.17.57@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 17:58:34 Lustre: Skipped 92 previous similar messages Console [vesta-mds1] log at 2013-05-19 18:00:00 PDT. 2013-05-19 18:03:23 LNet: Service thread pid 19486 was inactive for 394.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 18:03:23 Pid: 19486, comm: mdt_rdpg01_011 2013-05-19 18:03:23 2013-05-19 18:03:23 Call Trace: 2013-05-19 18:03:23 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 18:03:23 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 18:03:23 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 18:03:23 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 18:03:23 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 18:03:23 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 18:03:23 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 18:03:23 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 18:03:23 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 18:03:23 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 18:03:23 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 18:03:23 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 18:03:23 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 18:03:23 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 18:03:23 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 18:03:23 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 18:03:23 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 18:03:23 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 18:03:23 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 18:03:23 [] ? __wake_up+0x53/0x70 2013-05-19 18:03:23 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 18:03:23 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:03:23 [] child_rip+0xa/0x20 2013-05-19 18:03:23 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:03:23 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:03:23 [] ? child_rip+0x0/0x20 2013-05-19 18:03:23 2013-05-19 18:03:23 LustreError: dumping log to /tmp/lustre-log.1369011803.19486 2013-05-19 18:03:28 Lustre: 19303:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-71), not sending early reply 2013-05-19 18:03:28 req@ffff88009c4cf400 x1435222691945352/t0(0) o36->5b2baf6b-fbe0-2dce-ec8f-87e7e0cf53a4@172.20.17.69@o2ib500:0/0 lens 488/960 e 0 to 0 dl 1369011818 ref 2 fl Interpret:/0/ffffffff rc 0/-1 2013-05-19 18:04:34 Lustre: 19466:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (681:56s); client may timeout. req@ffff88009c4cf400 x1435222691945352/t21514887401(0) o36->5b2baf6b-fbe0-2dce-ec8f-87e7e0cf53a4@172.20.17.69@o2ib500:0/0 lens 488/424 e 0 to 0 dl 1369011818 ref 1 fl Complete:/0/ffffffff rc 0/-1 2013-05-19 18:05:38 LNet: Service thread pid 19486 completed after 528.28s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 18:10:52 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) reconnecting 2013-05-19 18:10:52 Lustre: Skipped 21 previous similar messages 2013-05-19 18:10:52 Lustre: fsv-MDT0000: Client 17ff88fb-4048-2eb7-2542-e350dd7981d7 (at 172.20.17.56@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 18:10:52 Lustre: Skipped 16 previous similar messages 2013-05-19 18:24:25 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) reconnecting 2013-05-19 18:24:25 Lustre: Skipped 1 previous similar message 2013-05-19 18:24:25 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 18:42:34 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) reconnecting 2013-05-19 18:42:34 Lustre: Skipped 4 previous similar messages 2013-05-19 18:42:34 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 18:42:34 Lustre: Skipped 2 previous similar messages 2013-05-19 18:42:59 LNet: Service thread pid 19390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 18:42:59 Pid: 19390, comm: mdt_rdpg00_006 2013-05-19 18:42:59 2013-05-19 18:42:59 Call Trace: 2013-05-19 18:42:59 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 18:42:59 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 18:42:59 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 18:42:59 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 18:42:59 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 18:42:59 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 18:42:59 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 18:42:59 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 18:42:59 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 18:42:59 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 18:42:59 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 18:42:59 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 18:42:59 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 18:42:59 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 18:42:59 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 18:42:59 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 18:42:59 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 18:42:59 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 18:42:59 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 18:42:59 [] ? __wake_up+0x53/0x70 2013-05-19 18:42:59 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 18:42:59 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:42:59 [] child_rip+0xa/0x20 2013-05-19 18:42:59 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:42:59 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 18:42:59 [] ? child_rip+0x0/0x20 2013-05-19 18:42:59 2013-05-19 18:42:59 LustreError: dumping log to /tmp/lustre-log.1369014179.19390 2013-05-19 18:44:08 LNet: Service thread pid 19390 completed after 268.84s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 18:54:19 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) reconnecting 2013-05-19 18:54:19 Lustre: Skipped 13 previous similar messages 2013-05-19 18:54:19 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 18:54:19 Lustre: Skipped 9 previous similar messages Console [vesta-mds1] log at 2013-05-19 19:00:00 PDT. 2013-05-19 19:03:46 LNet: Service thread pid 19466 was inactive for 440.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 19:03:46 Pid: 19466, comm: mdt01_061 2013-05-19 19:03:46 2013-05-19 19:03:46 Call Trace: 2013-05-19 19:03:46 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 19:03:46 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 19:03:46 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 19:03:46 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 19:03:46 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 19:03:46 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 19:03:46 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 19:03:46 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 19:03:46 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 19:03:46 [] mdd_create+0x929/0x1770 [mdd] 2013-05-19 19:03:46 [] ? lod_index_lookup+0x0/0x30 [lod] 2013-05-19 19:03:46 [] mdt_reint_open+0x1422/0x2120 [mdt] 2013-05-19 19:03:46 [] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] 2013-05-19 19:03:46 [] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] 2013-05-19 19:03:46 [] ? lu_ucred+0x20/0x30 [obdclass] 2013-05-19 19:03:46 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 19:03:46 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 19:03:46 [] mdt_intent_reint+0x1ed/0x4f0 [mdt] 2013-05-19 19:03:46 [] mdt_intent_policy+0x3ae/0x750 [mdt] 2013-05-19 19:03:46 [] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2013-05-19 19:03:46 [] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc] 2013-05-19 19:03:46 [] mdt_enqueue+0x46/0x110 [mdt] 2013-05-19 19:03:46 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 19:03:46 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 19:03:46 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 19:03:46 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 19:03:46 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 19:03:46 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 19:03:46 [] ? __wake_up+0x53/0x70 2013-05-19 19:03:46 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 19:03:46 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:03:46 [] child_rip+0xa/0x20 2013-05-19 19:03:46 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:03:46 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:03:46 [] ? child_rip+0x0/0x20 2013-05-19 19:03:46 2013-05-19 19:03:46 LustreError: dumping log to /tmp/lustre-log.1369015426.19466 2013-05-19 19:04:14 LNet: Service thread pid 19466 completed after 468.81s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 19:04:45 Lustre: fsv-MDT0000: Client 9ad1486f-ef5f-da50-df9b-c189d2cf037d (at 172.20.17.1@o2ib500) reconnecting 2013-05-19 19:04:45 Lustre: Skipped 12 previous similar messages 2013-05-19 19:09:49 LNet: Service thread pid 21571 was inactive for 822.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 19:09:49 Pid: 21571, comm: mdt03_061 2013-05-19 19:09:49 2013-05-19 19:09:49 Call Trace: 2013-05-19 19:09:49 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 19:09:49 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 19:09:49 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 19:09:49 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 19:09:49 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 19:09:49 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 19:09:49 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 19:09:49 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 19:09:49 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 19:09:49 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 19:09:49 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 19:09:49 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 19:09:49 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 19:09:49 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 19:09:49 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 19:09:49 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 19:09:49 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 19:09:49 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 19:09:49 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 19:09:49 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 19:09:49 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 19:09:49 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 19:09:49 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 19:09:49 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 19:09:49 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 19:09:49 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 19:09:49 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 19:09:49 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 19:09:49 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 19:09:50 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 19:09:50 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 19:09:50 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 19:09:50 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 19:09:50 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 19:09:50 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 19:09:50 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 19:09:50 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 19:09:50 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 19:09:50 [] ? __wake_up+0x53/0x70 2013-05-19 19:09:50 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 19:09:50 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:09:50 [] child_rip+0xa/0x20 2013-05-19 19:09:50 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:09:50 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:09:50 [] ? child_rip+0x0/0x20 2013-05-19 19:09:50 2013-05-19 19:09:50 LustreError: dumping log to /tmp/lustre-log.1369015790.21571 2013-05-19 19:10:26 LNet: Service thread pid 21571 completed after 859.09s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 19:23:28 Lustre: fsv-MDT0000: Client 6832df97-33e7-ff08-2cf9-099b761023f9 (at 172.20.17.6@o2ib500) reconnecting 2013-05-19 19:23:28 Lustre: fsv-MDT0000: Client 6832df97-33e7-ff08-2cf9-099b761023f9 (at 172.20.17.6@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:23:28 Lustre: Skipped 10 previous similar messages 2013-05-19 19:24:31 LNet: Service thread pid 18391 was inactive for 302.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 19:24:31 Pid: 18391, comm: mdt_rdpg01_001 2013-05-19 19:24:31 2013-05-19 19:24:31 Call Trace: 2013-05-19 19:24:31 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 19:24:31 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 19:24:31 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 19:24:31 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 19:24:31 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 19:24:31 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 19:24:31 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 19:24:31 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 19:24:31 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 19:24:31 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 19:24:31 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 19:24:31 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 19:24:31 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 19:24:31 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 19:24:31 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 19:24:31 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 19:24:31 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 19:24:31 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 19:24:31 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 19:24:31 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 19:24:31 [] ? __wake_up+0x53/0x70 2013-05-19 19:24:31 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 19:24:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:24:32 [] child_rip+0xa/0x20 2013-05-19 19:24:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:24:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 19:24:32 [] ? child_rip+0x0/0x20 2013-05-19 19:24:32 2013-05-19 19:24:32 LustreError: dumping log to /tmp/lustre-log.1369016671.18391 2013-05-19 19:24:44 Lustre: fsv-MDT0000: Client ae7a7435-c2d1-faa2-34d6-ceabad68f922 (at 172.20.17.12@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:24:44 Lustre: Skipped 6 previous similar messages 2013-05-19 19:24:57 Lustre: 20160:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 19:24:57 req@ffff880fdc6c5800 x1434323175441281/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 0 to 0 dl 1369016707 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 19:25:31 LNet: Service thread pid 18391 completed after 361.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 19:26:03 Lustre: 19358:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:56s); client may timeout. req@ffff880fdc6c5800 x1434323175441281/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 0 to 0 dl 1369016707 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 19:31:24 Lustre: fsv-MDT0000: Client 8b25dee6-6991-0987-01a9-c0fc2fe87bd5 (at 172.20.17.20@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:31:24 Lustre: Skipped 7 previous similar messages 2013-05-19 19:36:18 Lustre: fsv-MDT0000: Client 3df45b54-3e31-7ffc-b27d-3a89bb794e89 (at 172.20.17.14@o2ib500) reconnecting 2013-05-19 19:36:18 Lustre: Skipped 22 previous similar messages 2013-05-19 19:49:53 Lustre: 19236:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 19:49:53 req@ffff880a028a1800 x1434323175472744/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 0 to 0 dl 1369018203 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 19:50:48 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 19:50:48 Lustre: Skipped 1 previous similar message 2013-05-19 19:50:48 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:50:48 Lustre: Skipped 1 previous similar message 2013-05-19 19:51:38 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:51:38 Lustre: Skipped 1 previous similar message 2013-05-19 19:53:18 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 19:53:18 Lustre: Skipped 4 previous similar messages 2013-05-19 19:54:05 Lustre: 21571:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:242s); client may timeout. req@ffff880a028a1800 x1434323175472744/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 0 to 0 dl 1369018203 ref 1 fl Complete:/0/0 rc 0/0 Console [vesta-mds1] log at 2013-05-19 20:00:00 PDT. 2013-05-19 20:06:49 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) reconnecting 2013-05-19 20:06:49 Lustre: Skipped 11 previous similar messages 2013-05-19 20:06:49 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) refused reconnection, still busy with 2 active RPCs 2013-05-19 20:06:49 Lustre: Skipped 2 previous similar messages 2013-05-19 20:07:14 Lustre: fsv-MDT0000: Client d083cd94-cb92-3b4c-aff1-7f7522c7c5e9 (at 172.20.17.4@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:07:14 Lustre: Skipped 1 previous similar message 2013-05-19 20:09:56 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:10:07 LNet: Service thread pid 23163 was inactive for 380.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 20:10:07 Pid: 23163, comm: mdt00_063 2013-05-19 20:10:07 2013-05-19 20:10:07 Call Trace: 2013-05-19 20:10:07 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 20:10:07 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 20:10:07 [] ? mutex_lock+0x1e/0x50 2013-05-19 20:10:07 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 20:10:07 [] txg_wait_synced+0x7b/0xa0 [zfs] 2013-05-19 20:10:07 [] osd_trans_stop+0x365/0x420 [osd_zfs] 2013-05-19 20:10:07 [] lod_trans_stop+0xa4/0x130 [lod] 2013-05-19 20:10:07 [] mdd_trans_stop+0x1d/0x20 [mdd] 2013-05-19 20:10:07 [] mdd_attr_set+0x4d2/0x1390 [mdd] 2013-05-19 20:10:07 [] mdt_attr_set+0x268/0x560 [mdt] 2013-05-19 20:10:07 [] mdt_reint_setattr+0x5f4/0xd10 [mdt] 2013-05-19 20:10:07 [] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2013-05-19 20:10:07 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 20:10:07 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 20:10:07 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 20:10:07 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 20:10:07 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 20:10:07 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 20:10:07 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 20:10:07 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 20:10:07 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 20:10:07 [] ? __wake_up+0x53/0x70 2013-05-19 20:10:07 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 20:10:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:10:07 [] child_rip+0xa/0x20 2013-05-19 20:10:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:10:07 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:10:07 [] ? child_rip+0x0/0x20 2013-05-19 20:10:07 2013-05-19 20:10:07 LustreError: dumping log to /tmp/lustre-log.1369019407.23163 2013-05-19 20:10:58 LNet: Service thread pid 23163 completed after 431.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 20:11:10 LNet: Service thread pid 19296 was inactive for 330.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 20:11:10 Pid: 19296, comm: mdt03_027 2013-05-19 20:11:10 2013-05-19 20:11:10 Call Trace: 2013-05-19 20:11:10 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 20:11:10 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 20:11:10 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 20:11:10 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 20:11:10 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 20:11:10 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 20:11:11 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 20:11:11 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 20:11:11 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 20:11:11 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 20:11:11 [] mdd_unlink+0x40e/0xe20 [mdd] 2013-05-19 20:11:11 [] mdo_unlink+0x18/0x50 [mdt] 2013-05-19 20:11:11 [] mdt_reint_unlink+0x739/0xfd0 [mdt] 2013-05-19 20:11:11 [] mdt_reint_rec+0x41/0xe0 [mdt] 2013-05-19 20:11:11 [] mdt_reint_internal+0x4e3/0x7d0 [mdt] 2013-05-19 20:11:11 [] mdt_reint+0x44/0xe0 [mdt] 2013-05-19 20:11:11 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 20:11:11 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 20:11:11 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 20:11:11 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 20:11:11 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 20:11:11 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 20:11:11 [] ? __wake_up+0x53/0x70 2013-05-19 20:11:11 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 20:11:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:11:11 [] child_rip+0xa/0x20 2013-05-19 20:11:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:11:11 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:11:11 [] ? child_rip+0x0/0x20 2013-05-19 20:11:11 2013-05-19 20:11:11 LustreError: dumping log to /tmp/lustre-log.1369019471.19296 2013-05-19 20:11:36 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:11:36 Lustre: Skipped 3 previous similar messages 2013-05-19 20:12:26 LNet: Service thread pid 19296 completed after 406.01s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 20:12:37 Lustre: lock timed out (enqueued at 1369019357, 200s ago) 2013-05-19 20:19:45 Lustre: lock timed out (enqueued at 1369019763, 222s ago) 2013-05-19 20:19:57 Lustre: fsv-MDT0000: Client 6358246e-d8d0-553e-4f86-b13e9ef270c7 (at 172.20.17.31@o2ib500) reconnecting 2013-05-19 20:19:57 Lustre: fsv-MDT0000: Client 45f14a6a-ce53-53b0-be61-f901cb11ca1a (at 172.20.17.18@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:19:57 Lustre: Skipped 3 previous similar messages 2013-05-19 20:19:57 Lustre: Skipped 15 previous similar messages 2013-05-19 20:20:14 Lustre: lock timed out (enqueued at 1369019792, 222s ago) 2013-05-19 20:21:06 Lustre: lock timed out (enqueued at 1369019844, 222s ago) 2013-05-19 20:21:24 Lustre: lock timed out (enqueued at 1369019862, 222s ago) 2013-05-19 20:21:24 Lustre: Skipped 1 previous similar message 2013-05-19 20:21:48 LNet: Service thread pid 19229 was inactive for 428.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 20:21:48 Pid: 19229, comm: mdt_rdpg03_006 2013-05-19 20:21:48 2013-05-19 20:21:48 Call Trace: 2013-05-19 20:21:48 [] ? __mutex_lock_slowpath+0x70/0x180 2013-05-19 20:21:48 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 20:21:48 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 20:21:48 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 20:21:48 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 20:21:48 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 20:21:48 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 20:21:48 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 20:21:48 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 20:21:48 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 20:21:48 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 20:21:48 [] mdd_attr_set+0x4a3/0x1390 [mdd] 2013-05-19 20:21:48 [] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc] 2013-05-19 20:21:48 [] mdt_mfd_close+0x502/0x6e0 [mdt] 2013-05-19 20:21:48 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 20:21:48 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 20:21:48 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 20:21:48 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 20:21:48 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 20:21:48 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 20:21:48 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 20:21:48 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 20:21:48 [] ? default_wake_function+0x0/0x20 2013-05-19 20:21:48 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] child_rip+0xa/0x20 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? child_rip+0x0/0x20 2013-05-19 20:21:48 2013-05-19 20:21:48 LustreError: dumping log to /tmp/lustre-log.1369020108.19229 2013-05-19 20:21:48 LNet: Service thread pid 18908 was inactive for 428.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 20:21:48 Pid: 18908, comm: mdt_rdpg03_002 2013-05-19 20:21:48 2013-05-19 20:21:48 Call Trace: 2013-05-19 20:21:48 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 20:21:48 [] ? wake_up_process+0x15/0x20 2013-05-19 20:21:48 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 20:21:48 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 20:21:48 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 20:21:48 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 20:21:48 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 20:21:48 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 20:21:48 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 20:21:48 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 20:21:48 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 20:21:48 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 20:21:48 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 20:21:48 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 20:21:48 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 20:21:48 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 20:21:48 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 20:21:48 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 20:21:48 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 20:21:48 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 20:21:48 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 20:21:48 [] ? __wake_up+0x53/0x70 2013-05-19 20:21:48 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] child_rip+0xa/0x20 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:21:48 [] ? child_rip+0x0/0x20 2013-05-19 20:21:48 2013-05-19 20:22:17 LNet: Service thread pid 18908 completed after 457.36s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 20:25:22 Lustre: fsv-MDT0000: Client 6358246e-d8d0-553e-4f86-b13e9ef270c7 (at 172.20.17.31@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:25:22 Lustre: Skipped 19 previous similar messages 2013-05-19 20:25:41 LNet: Service thread pid 19229 completed after 661.90s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 20:29:12 Lustre: 21579:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/10), not sending early reply 2013-05-19 20:29:12 req@ffff8810319cac00 x1434323175603504/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 0 to 0 dl 1369020562 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 20:30:07 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 20:30:07 Lustre: Skipped 22 previous similar messages 2013-05-19 20:30:52 Lustre: 19238:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:90s); client may timeout. req@ffff8810319cac00 x1434323175603504/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 0 to 0 dl 1369020562 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 20:51:55 Lustre: fsv-MDT0000: Client aa38f7c7-33e9-337c-1dbd-c0ccc872b373 (at 172.20.17.26@o2ib500) reconnecting 2013-05-19 20:51:55 Lustre: Skipped 2 previous similar messages 2013-05-19 20:51:55 Lustre: fsv-MDT0000: Client aa38f7c7-33e9-337c-1dbd-c0ccc872b373 (at 172.20.17.26@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:51:55 Lustre: Skipped 2 previous similar messages 2013-05-19 20:53:30 Lustre: fsv-MDT0000: Client a2ee9720-9752-3971-2112-55ef1cacfea0 (at 172.20.17.15@o2ib500) reconnecting 2013-05-19 20:53:30 Lustre: Skipped 7 previous similar messages 2013-05-19 20:53:34 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:53:34 Lustre: Skipped 6 previous similar messages 2013-05-19 20:55:45 LNet: Service thread pid 19296 was inactive for 540.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 20:55:45 Pid: 19296, comm: mdt03_027 2013-05-19 20:55:45 2013-05-19 20:55:45 Call Trace: 2013-05-19 20:55:45 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 20:55:45 [] ? wake_up_process+0x15/0x20 2013-05-19 20:55:45 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 20:55:45 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 20:55:45 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 20:55:45 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 20:55:45 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 20:55:45 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 20:55:45 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 20:55:45 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 20:55:45 [] ? dmu_read+0x12b/0x180 [zfs] 2013-05-19 20:55:45 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 20:55:45 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 20:55:45 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 20:55:45 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 20:55:45 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 20:55:45 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 20:55:45 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 20:55:45 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 20:55:45 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 20:55:45 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 20:55:45 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 20:55:45 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 20:55:45 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 20:55:45 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 20:55:45 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 20:55:45 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 20:55:45 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 20:55:45 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 20:55:45 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 20:55:45 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 20:55:45 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 20:55:45 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 20:55:45 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 20:55:45 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 20:55:45 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 20:55:45 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 20:55:45 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 20:55:45 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 20:55:45 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 20:55:45 [] ? __wake_up+0x53/0x70 2013-05-19 20:55:45 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 20:55:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:55:45 [] child_rip+0xa/0x20 2013-05-19 20:55:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:55:45 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 20:55:45 [] ? child_rip+0x0/0x20 2013-05-19 20:55:45 2013-05-19 20:55:45 LustreError: dumping log to /tmp/lustre-log.1369022145.19296 2013-05-19 20:56:04 Lustre: fsv-MDT0000: Client 40d970fd-fb56-40b5-1331-282547979508 (at 172.20.17.28@o2ib500) reconnecting 2013-05-19 20:56:04 Lustre: Skipped 12 previous similar messages 2013-05-19 20:56:22 Lustre: fsv-MDT0000: Client f6bdb649-0807-346d-2047-6641007ec46d (at 172.20.17.11@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 20:56:22 Lustre: Skipped 10 previous similar messages 2013-05-19 20:58:09 Lustre: 18467:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-84), not sending early reply 2013-05-19 20:58:09 req@ffff880fec2a2000 x1434323175809620/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1369022299 ref 2 fl Interpret:/0/0 rc 0/0 Console [vesta-mds1] log at 2013-05-19 21:00:00 PDT. 2013-05-19 21:01:29 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 21:01:29 Lustre: Skipped 6 previous similar messages 2013-05-19 21:01:29 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 21:01:29 Lustre: Skipped 3 previous similar messages 2013-05-19 21:01:59 Lustre: 19296:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (694:220s); client may timeout. req@ffff880fec2a2000 x1434323175809620/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1369022299 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 21:01:59 LNet: Service thread pid 19296 completed after 914.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 21:20:36 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) reconnecting 2013-05-19 21:20:36 Lustre: Skipped 2 previous similar messages 2013-05-19 21:20:36 Lustre: fsv-MDT0000: Client c8eeb8ae-86e1-99c4-bcfe-97f4e92d9da6 (at 172.20.17.54@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 21:20:36 Lustre: Skipped 1 previous similar message 2013-05-19 21:21:16 LNet: Service thread pid 18990 was inactive for 238.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 21:21:16 Pid: 18990, comm: mdt_rdpg00_002 2013-05-19 21:21:16 2013-05-19 21:21:16 Call Trace: 2013-05-19 21:21:16 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 21:21:16 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 21:21:16 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 21:21:16 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 21:21:16 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 21:21:16 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 21:21:16 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 21:21:16 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 21:21:16 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 21:21:16 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 21:21:16 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 21:21:16 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 21:21:16 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 21:21:16 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 21:21:16 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 21:21:16 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 21:21:16 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 21:21:16 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 21:21:16 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 21:21:16 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 21:21:16 [] ? __wake_up+0x53/0x70 2013-05-19 21:21:16 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 21:21:16 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] child_rip+0xa/0x20 2013-05-19 21:21:17 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] ? child_rip+0x0/0x20 2013-05-19 21:21:17 2013-05-19 21:21:17 LustreError: dumping log to /tmp/lustre-log.1369023677.18990 2013-05-19 21:21:17 LNet: Service thread pid 22556 was inactive for 238.18s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 21:21:17 Pid: 22556, comm: mdt_rdpg00_014 2013-05-19 21:21:17 2013-05-19 21:21:17 Call Trace: 2013-05-19 21:21:17 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 21:21:17 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 21:21:17 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 21:21:17 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 21:21:17 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 21:21:17 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 21:21:17 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 21:21:17 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 21:21:17 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 21:21:17 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 21:21:17 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 21:21:17 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 21:21:17 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 21:21:17 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 21:21:17 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 21:21:17 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 21:21:17 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 21:21:17 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 21:21:17 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 21:21:17 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 21:21:17 [] ? __wake_up+0x53/0x70 2013-05-19 21:21:17 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] child_rip+0xa/0x20 2013-05-19 21:21:17 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:21:17 [] ? child_rip+0x0/0x20 2013-05-19 21:21:17 2013-05-19 21:24:18 LNet: Service thread pid 22556 completed after 419.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 21:24:21 LNet: Service thread pid 18990 completed after 422.20s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 21:24:24 LNet: Service thread pid 19317 was inactive for 682.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 21:24:24 Pid: 19317, comm: mdt03_032 2013-05-19 21:24:24 2013-05-19 21:24:24 Call Trace: 2013-05-19 21:24:24 [] ? try_to_wake_up+0x24e/0x3e0 2013-05-19 21:24:24 [] ? wake_up_process+0x15/0x20 2013-05-19 21:24:24 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 21:24:24 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 21:24:24 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 21:24:24 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 21:24:24 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 21:24:24 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 21:24:24 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 21:24:24 [] llog_write+0x22c/0x440 [obdclass] 2013-05-19 21:24:24 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 21:24:24 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 21:24:24 [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 21:24:24 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 21:24:24 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 21:24:24 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 21:24:24 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 21:24:24 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 21:24:24 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 21:24:24 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 21:24:24 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 21:24:24 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 21:24:24 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 21:24:24 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 21:24:24 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 21:24:24 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 21:24:24 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 21:24:24 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 21:24:24 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 21:24:24 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 21:24:24 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 21:24:24 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 21:24:24 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 21:24:24 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 21:24:24 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 21:24:24 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 21:24:24 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 21:24:24 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 21:24:24 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 21:24:24 [] ? default_wake_function+0x0/0x20 2013-05-19 21:24:24 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 21:24:24 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:24:24 [] child_rip+0xa/0x20 2013-05-19 21:24:24 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:24:24 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:24:24 [] ? child_rip+0x0/0x20 2013-05-19 21:24:24 2013-05-19 21:24:24 LustreError: dumping log to /tmp/lustre-log.1369023864.19317 2013-05-19 21:26:28 LNet: Service thread pid 19390 was inactive for 346.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-05-19 21:26:28 Pid: 19390, comm: mdt_rdpg00_006 2013-05-19 21:26:28 2013-05-19 21:26:28 Call Trace: 2013-05-19 21:26:28 [] ? prepare_to_wait_exclusive+0x4e/0x80 2013-05-19 21:26:28 [] cv_wait_common+0xed/0x100 [spl] 2013-05-19 21:26:28 [] ? autoremove_wake_function+0x0/0x40 2013-05-19 21:26:28 [] __cv_wait+0x15/0x20 [spl] 2013-05-19 21:26:28 [] txg_wait_open+0x7b/0xa0 [zfs] 2013-05-19 21:26:28 [] dmu_tx_wait+0xed/0xf0 [zfs] 2013-05-19 21:26:28 [] dmu_tx_assign+0x86/0x480 [zfs] 2013-05-19 21:26:28 [] osd_trans_start+0x9c/0x410 [osd_zfs] 2013-05-19 21:26:28 [] lod_trans_start+0x1b9/0x250 [lod] 2013-05-19 21:26:28 [] mdd_trans_start+0x17/0x20 [mdd] 2013-05-19 21:26:28 [] mdd_close+0x6ae/0xb80 [mdd] 2013-05-19 21:26:28 [] mdt_mfd_close+0x129/0x6e0 [mdt] 2013-05-19 21:26:28 [] mdt_close+0x682/0xac0 [mdt] 2013-05-19 21:26:28 [] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-05-19 21:26:28 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 21:26:28 [] mds_readpage_handle+0x15/0x20 [mdt] 2013-05-19 21:26:28 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 21:26:28 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 21:26:28 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 21:26:28 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 21:26:28 [] ? __wake_up+0x53/0x70 2013-05-19 21:26:28 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 21:26:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:26:28 [] child_rip+0xa/0x20 2013-05-19 21:26:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:26:28 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 21:26:28 [] ? child_rip+0x0/0x20 2013-05-19 21:26:28 2013-05-19 21:26:28 LustreError: dumping log to /tmp/lustre-log.1369023988.19390 2013-05-19 21:27:23 Lustre: 19296:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-262), not sending early reply 2013-05-19 21:27:23 req@ffff880a9c313000 x1434323175942767/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 1 to 0 dl 1369024053 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 21:30:41 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) reconnecting 2013-05-19 21:30:41 Lustre: Skipped 53 previous similar messages 2013-05-19 21:30:41 Lustre: fsv-MDT0000: Client 39b49672-90d0-b52d-c6ad-7c9af1a746fb (at 172.20.5.108@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 21:30:41 Lustre: Skipped 48 previous similar messages 2013-05-19 21:31:34 Lustre: 19317:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (872:241s); client may timeout. req@ffff880a9c313000 x1434323175942767/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 1 to 0 dl 1369024053 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 21:31:34 LNet: Service thread pid 19317 completed after 1112.18s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 21:32:25 LNet: Service thread pid 19390 completed after 702.92s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2013-05-19 21:33:08 sd 1:0:9:0: [sdy] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE 2013-05-19 21:33:08 sd 1:0:9:0: [sdy] Sense Key : Aborted Command [current] 2013-05-19 21:33:08 sd 1:0:9:0: [sdy] Add. Sense: Nak received 2013-05-19 21:33:08 sd 1:0:9:0: [sdy] CDB: Read(10): 28 00 11 e2 9b e2 00 00 05 00 2013-05-19 21:33:08 device-mapper: multipath: Failing path 65:128. 2013-05-19 21:44:18 Lustre: fsv-MDT0000: Client 87a33d2b-615f-9d07-50f0-4fa24a62e523 (at 172.20.17.41@o2ib500) reconnecting 2013-05-19 21:44:18 Lustre: Skipped 16 previous similar messages 2013-05-19 21:44:18 Lustre: fsv-MDT0000: Client 87a33d2b-615f-9d07-50f0-4fa24a62e523 (at 172.20.17.41@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 21:44:18 Lustre: Skipped 12 previous similar messages 2013-05-19 21:45:19 Lustre: 19326:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 21:45:19 req@ffff880a2f505800 x1434323175998168/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 0 to 0 dl 1369025129 ref 2 fl Interpret:/0/0 rc 0/0 2013-05-19 21:47:00 Lustre: 20158:0:(service.c:1995:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:91s); client may timeout. req@ffff880a2f505800 x1434323175998168/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/192 e 0 to 0 dl 1369025129 ref 1 fl Complete:/0/0 rc 0/0 2013-05-19 21:54:34 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) reconnecting 2013-05-19 21:54:34 Lustre: Skipped 21 previous similar messages 2013-05-19 21:54:34 Lustre: fsv-MDT0000: Client b9fc389d-a11a-2c50-9959-edeb6fa9f291 (at 172.20.17.22@o2ib500) refused reconnection, still busy with 1 active RPCs 2013-05-19 21:54:34 Lustre: Skipped 20 previous similar messages 2013-05-19 21:59:41 Lustre: 18467:0:(service.c:1296:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-145), not sending early reply 2013-05-19 21:59:41 req@ffff880ffde18800 x1434323176032632/t0(0) o46->39b49672-90d0-b52d-c6ad-7c9af1a746fb@172.20.5.108@o2ib500:0/0 lens 264/224 e 0 to 0 dl 1369025991 ref 2 fl Interpret:/0/0 rc 0/0 Console [vesta-mds1] log at 2013-05-19 22:00:00 PDT. 2013-05-19 22:00:50 Lustre: lock timed out (enqueued at 1369025850, 200s ago) 2013-05-19 22:03:00 LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1841s: evicting client at 172.20.5.108@o2ib500 ns: mdt-ffff88078ca74000 lock: ffff880a352ca200/0x48db03b08854eea5 lrc: 3/0,0 mode: PR/PR res: 8589945367/96000 bits 0x13 rrc: 6 type: IBT flags: 0x200000000020 nid: 172.20.5.108@o2ib500 remote: 0x94f247df14d0c0ae expref: 28510 pid: 19585 timeout: 4416356333 lvb_type: 0 2013-05-19 22:03:31 LustreError: 21576:0:(osd_object.c:727:osd_attr_get()) ASSERTION( dt_object_exists(dt) ) failed: 2013-05-19 22:03:31 LustreError: 21576:0:(osd_object.c:727:osd_attr_get()) LBUG 2013-05-19 22:03:31 Pid: 21576, comm: mdt03_066 2013-05-19 22:03:31 May 19 22:03:31 2013-05-19 22:03:31 vesta-mds1 kerneCall Trace: 2013-05-19 22:03:31 l: LustreError: [] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-05-19 22:03:31 21576:0:(osd_obj [] lbug_with_loc+0x47/0xb0 [libcfs] 2013-05-19 22:03:31 ect.c:727:osd_at [] osd_attr_get+0x1a2/0x1e0 [osd_zfs] 2013-05-19 22:03:31 tr_get()) ASSERT [] llog_osd_write_rec+0x160/0x11f0 [obdclass] 2013-05-19 22:03:31 ION( dt_object_e [] llog_write_rec+0xc8/0x290 [obdclass] 2013-05-19 22:03:31 xists(dt) ) fail [] llog_write+0x2f5/0x440 [obdclass] 2013-05-19 22:03:31 ed: 2013-05-19 22:03:31 May 19 22:0 [] llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 22:03:31 3:31 vesta-mds1 [] llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 22:03:31 kernel: LustreEr [] llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 22:03:31 ror: 21576:0:(os [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 22:03:31 d_object.c:727:o [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 22:03:31 sd_attr_get()) L [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 22:03:31 BUG 2013-05-19 22:03:31 [] llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 22:03:31 [] llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 22:03:31 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 22:03:31 [] llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 22:03:31 [] llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 22:03:31 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 22:03:31 [] llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 22:03:31 [] llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 22:03:31 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 22:03:31 [] llog_cancel+0x58/0x250 [obdclass] 2013-05-19 22:03:31 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 22:03:31 [] mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 22:03:31 [] mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 22:03:31 [] mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 22:03:31 [] mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 22:03:31 [] mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 22:03:31 [] mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 22:03:31 [] mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 22:03:31 [] mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 22:03:31 [] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 22:03:31 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 22:03:31 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 22:03:31 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 22:03:31 [] ? __wake_up+0x53/0x70 2013-05-19 22:03:31 [] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 22:03:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:31 [] child_rip+0xa/0x20 2013-05-19 22:03:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:31 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:31 [] ? child_rip+0x0/0x20 2013-05-19 22:03:31 2013-05-19 22:03:31 Kernel panic - not syncing: LBUG 2013-05-19 22:03:31 Pid: 21576, comm: mdt03_066 Tainted: P --------------- 2.6.32-358.6z2.1.1chaos.ch5.1.x86_64 #1 2013-05-19 22:03:31 Call Trace: 2013-05-19 22:03:31 [] ? panic+0xa7/0x16f 2013-05-19 22:03:31 [] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-05-19 22:03:31 [] ? osd_attr_get+0x1a2/0x1e0 [osd_zfs] 2013-05-19 22:03:31 [] ? llog_osd_write_rec+0x160/0x11f0 [obdclass] 2013-05-19 22:03:31 [] ? llog_write_rec+0xc8/0x290 [obdclass] 2013-05-19 22:03:31 [] ? llog_write+0x2f5/0x440 [obdclass] 2013-05-19 22:03:31 [] ? llog_cancel_rec+0xbc/0x560 [obdclass] 2013-05-19 22:03:31 [] ? llog_cat_cancel_records+0xfe/0x260 [obdclass] 2013-05-19 22:03:31 [] ? llog_changelog_cancel_cb+0x141/0x1d0 [mdd] 2013-05-19 22:03:31 [] ? llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 22:03:31 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 22:03:31 [] ? llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 22:03:31 [] ? llog_cat_process_cb+0x2e2/0x390 [obdclass] 2013-05-19 22:03:32 [] ? llog_process_thread+0x8fb/0xe00 [obdclass] 2013-05-19 22:03:32 [] ? llog_cat_process_cb+0x0/0x390 [obdclass] 2013-05-19 22:03:32 [] ? llog_process_or_fork+0x12d/0x660 [obdclass] 2013-05-19 22:03:32 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 22:03:32 [] ? llog_changelog_cancel_cb+0x0/0x1d0 [mdd] 2013-05-19 22:03:32 [] ? llog_cat_process+0x19/0x20 [obdclass] 2013-05-19 22:03:32 [] ? llog_changelog_cancel+0x5f/0x1c0 [mdd] 2013-05-19 22:03:32 [] ? llog_cat_process_or_fork+0x89/0x280 [obdclass] 2013-05-19 22:03:32 [] ? llog_cancel+0x58/0x250 [obdclass] 2013-05-19 22:03:32 [] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-05-19 22:03:32 [] ? mdd_changelog_llog_cancel+0x12e/0x240 [mdd] 2013-05-19 22:03:32 [] ? mdd_changelog_user_purge+0x360/0x540 [mdd] 2013-05-19 22:03:32 [] ? mdd_iocontrol+0x2a3/0xbd0 [mdd] 2013-05-19 22:03:32 [] ? mdt_ioc_child+0x149/0x1d0 [mdt] 2013-05-19 22:03:32 [] ? mdt_iocontrol+0x2e8/0x7a0 [mdt] 2013-05-19 22:03:32 [] ? mdt_set_info+0x1e6/0x480 [mdt] 2013-05-19 22:03:32 [] ? mdt_handle_common+0x648/0x1660 [mdt] 2013-05-19 22:03:32 [] ? mds_regular_handle+0x15/0x20 [mdt] 2013-05-19 22:03:32 [] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-05-19 22:03:32 [] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-05-19 22:03:32 [] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-05-19 22:03:32 [] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-05-19 22:03:32 [] ? __wake_up+0x53/0x70 2013-05-19 22:03:32 [] ? ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-05-19 22:03:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:32 [] ? child_rip+0xa/0x20 2013-05-19 22:03:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:32 [] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-05-19 22:03:32 [] ? child_rip+0x0/0x20 2013-05-19 22:03:32 Initializing cgroup subsys cpuset 2013-05-19 22:03:32 Initializing cgroup subsys cpu 2013-05-19 22:03:32 Linux version 2.6.32-358.6z2.1.1chaos.ch5.1.x86_64 (mockbuild@builder2) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Apr 3 16:05:11 PDT 2013