Sep 4 09:54:32 cs04r-sc-mds03-02 kernel: LDISKFS-fs warning (device dm-4): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait. Sep 4 09:54:32 cs04r-sc-mds03-02 kernel: Sep 4 09:55:14 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-4): warning: maximal mount count reached, running e2fsck is recommended Sep 4 09:55:14 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-4): recovery complete Sep 4 09:55:14 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=off. Opts: Sep 4 09:55:21 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:22 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:24 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:24 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 09:55:26 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:26 cs04r-sc-mds03-02 kernel: Lustre: Skipped 4 previous similar messages Sep 4 09:55:34 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:34 cs04r-sc-mds03-02 kernel: Lustre: Skipped 11 previous similar messages Sep 4 09:55:36 cs04r-sc-mds03-02 kernel: LDISKFS-fs warning (device dm-5): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait. Sep 4 09:55:36 cs04r-sc-mds03-02 kernel: Sep 4 09:55:42 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:42 cs04r-sc-mds03-02 kernel: Lustre: Skipped 10 previous similar messages Sep 4 09:55:58 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:55:58 cs04r-sc-mds03-02 kernel: Lustre: Skipped 40 previous similar messages Sep 4 09:56:21 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-5): warning: maximal mount count reached, running e2fsck is recommended Sep 4 09:56:21 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-5): recovery complete Sep 4 09:56:21 cs04r-sc-mds03-02 kernel: LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. quota=off. Opts: Sep 4 09:56:24 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1409820984/real 1409820984] req@ffff880fb62ecc00 x1477157865914384/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409820989 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:56:24 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 172.23.142.183@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:25 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 172.23.122.32@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:27 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.144.33@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:27 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.144.33@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:27 cs04r-sc-mds03-02 kernel: LustreError: Skipped 2 previous similar messages Sep 4 09:56:27 cs04r-sc-mds03-02 kernel: LustreError: Skipped 5 previous similar messages Sep 4 09:56:29 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.148.5@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:29 cs04r-sc-mds03-02 kernel: LustreError: Skipped 4 previous similar messages Sep 4 09:56:30 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:56:30 cs04r-sc-mds03-02 kernel: Lustre: Skipped 53 previous similar messages Sep 4 09:56:33 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 172.23.132.31@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:33 cs04r-sc-mds03-02 kernel: LustreError: Skipped 7 previous similar messages Sep 4 09:56:35 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880fb62ec800 x1477157865914388/t0(0) o253->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:56:35 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(obd_mount_server.c:1136:server_register_target()) lustre03-MDT0000: error registering with the MGS: rc = -5 (not fatal) Sep 4 09:56:41 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880fb62ec800 x1477157865914392/t0(0) o101->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:56:41 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.140.29@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 09:56:41 cs04r-sc-mds03-02 kernel: LustreError: Skipped 20 previous similar messages Sep 4 09:56:47 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880fb62ec800 x1477157865914396/t0(0) o101->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:56:47 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 10.144.140.15@o2ib (not set up) Sep 4 09:56:47 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: used disk, loading Sep 4 09:56:47 cs04r-sc-mds03-02 kernel: LustreError: 43947:0:(sec_config.c:1121:sptlrpc_target_local_read_conf()) missing llog context Sep 4 09:56:47 cs04r-sc-mds03-02 kernel: Lustre: 43947:0:(mdt_handler.c:5246:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete. Sep 4 09:56:53 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880f1d06d800 x1477157865914400/t0(0) o101->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:56:59 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821009/real 1409821009] req@ffff882013bf2400 x1477157865914408/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821019 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:56:59 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880f1d06d800 x1477157865914412/t0(0) o101->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:56:59 cs04r-sc-mds03-02 kernel: LustreError: 13a-8: Failed to get MGS log params and no local copy. Sep 4 09:57:10 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821025/real 1409821025] req@ffff881f70ae1000 x1477157865914420/t0(0) o38->lustre03-MDT0000-lwp-MDT0000@10.144.144.1@o2ib:12/10 lens 400/544 e 0 to 1 dl 1409821030 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:57:11 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff880f1d06d800 x1477157865914424/t0(0) o101->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Sep 4 09:57:11 cs04r-sc-mds03-02 kernel: LustreError: 43873:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message Sep 4 09:57:11 cs04r-sc-mds03-02 kernel: LustreError: 13a-8: Failed to get MGS log params and no local copy. Sep 4 09:57:30 cs04r-sc-mds03-02 kernel: LustreError: 11-0: lustre03-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Sep 4 09:57:34 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 09:57:34 cs04r-sc-mds03-02 kernel: Lustre: Skipped 114 previous similar messages Sep 4 09:57:43 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDD0000: changelog on Sep 4 09:57:43 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Will be in recovery for at least 5:00, or until 70 clients reconnect Sep 4 09:57:43 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client 147020c2-191d-3bff-313d-addaa7ebb06d (at 172.23.107.32@tcp), waiting for all 70 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59 Sep 4 09:57:44 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client 73e510f8-7fd7-6a13-d94c-6f15f0898f41 (at 172.23.132.12@tcp), waiting for all 70 known clients (2 recovered, 1 in progress, and 0 evicted) to recover in 14:34 Sep 4 09:57:45 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821050/real 1409821050] req@ffff881fc2567400 x1477157865914436/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821065 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:57:45 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client ce7b52f4-7c37-7ae9-1ae0-0cb5f21e02b6 (at 172.23.86.32@tcp), waiting for all 70 known clients (3 recovered, 2 in progress, and 0 evicted) to recover in 14:32 Sep 4 09:57:45 cs04r-sc-mds03-02 kernel: Lustre: Skipped 7 previous similar messages Sep 4 09:57:47 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client f7be1296-9985-357c-d4f6-dffc55b70074 (at 172.23.120.33@tcp), waiting for all 70 known clients (7 recovered, 4 in progress, and 0 evicted) to recover in 14:30 Sep 4 09:57:47 cs04r-sc-mds03-02 kernel: Lustre: Skipped 13 previous similar messages Sep 4 09:57:52 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client 442c4dc2-330a-5927-4aec-81bcd2a6fca8 (at 172.23.130.3@tcp), waiting for all 70 known clients (13 recovered, 15 in progress, and 0 evicted) to recover in 14:26 Sep 4 09:57:52 cs04r-sc-mds03-02 kernel: Lustre: Skipped 23 previous similar messages Sep 4 09:58:00 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client fef7cdd2-990f-337e-24a3-8e3447bb4674 (at 172.23.88.60@tcp), waiting for all 70 known clients (23 recovered, 32 in progress, and 0 evicted) to recover in 14:18 Sep 4 09:58:00 cs04r-sc-mds03-02 kernel: Lustre: Skipped 47 previous similar messages Sep 4 09:58:15 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821075/real 1409821075] req@ffff881fb5570c00 x1477157865914448/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821095 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:58:16 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client c29919b1-d6d0-7f38-cdec-0a50c0b3345c (at 172.23.136.19@tcp), waiting for all 70 known clients (30 recovered, 40 in progress, and 0 evicted) to recover in 14:01 Sep 4 09:58:16 cs04r-sc-mds03-02 kernel: Lustre: Skipped 102 previous similar messages Sep 4 09:58:45 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821100/real 1409821100] req@ffff881fc5592800 x1477157865914460/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821125 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:58:48 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client d385aaf1-7127-45d8-e931-6ba00b7a8b77 (at 172.23.134.69@tcp), waiting for all 70 known clients (30 recovered, 40 in progress, and 0 evicted) to recover in 13:29 Sep 4 09:58:48 cs04r-sc-mds03-02 kernel: Lustre: Skipped 237 previous similar messages Sep 4 09:59:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821150/real 1409821150] req@ffff881fc5ddf000 x1477157865914480/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821180 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 09:59:52 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client f7be1296-9985-357c-d4f6-dffc55b70074 (at 172.23.120.33@tcp), waiting for all 70 known clients (30 recovered, 40 in progress, and 0 evicted) to recover in 12:25 Sep 4 09:59:52 cs04r-sc-mds03-02 kernel: Lustre: Skipped 527 previous similar messages Sep 4 10:00:23 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 10:00:23 cs04r-sc-mds03-02 kernel: Lustre: Skipped 21 previous similar messages Sep 4 10:00:35 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821200/real 1409821200] req@ffff881fcfcbd000 x1477157865914500/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821235 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:00:58 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(ldlm_lib.c:1751:check_for_next_transno()) lustre03-MDT0000: waking for gap in transno, VBR is OFF (skip: 165414918530, ql: 22, comp: 48, conn: 70, next: 165414918531, last_committed: 165414918529) Sep 4 10:00:58 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 6 of 32 Sep 4 10:00:59 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 2 of 32 Sep 4 10:00:59 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 28 previous similar messages Sep 4 10:01:01 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(ldlm_lib.c:1751:check_for_next_transno()) lustre03-MDT0000: waking for gap in transno, VBR is OFF (skip: 165414918562, ql: 21, comp: 49, conn: 70, next: 165414918563, last_committed: 165414918529) Sep 4 10:01:01 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 3 of 32 Sep 4 10:01:03 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 20 of 32 Sep 4 10:01:03 cs04r-sc-mds03-02 kernel: LustreError: 44052:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 126 previous similar messages Sep 4 10:02:01 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client 2f9e3e1d-5b49-88d4-c6fd-9f4516784015 (at 172.23.132.21@tcp), waiting for all 70 known clients (38 recovered, 32 in progress, and 0 evicted) to recover in 12:33 Sep 4 10:02:01 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1107 previous similar messages Sep 4 10:02:25 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821300/real 1409821300] req@ffff881f67ffb000 x1477157865914540/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821345 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:02:25 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: INFO: task tgt_recov:44052 blocked for more than 120 seconds. Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: tgt_recov D 0000000000000001 0 44052 2 0x00000080 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: ffff880fe1539da0 0000000000000046 ffff880fe1539d00 ffff880fe1539d64 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: ffff881f00000000 ffff88103fc28800 0000000000000046 0000000000000046 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: ffff880fe0689ab8 ffff880fe1539fd8 000000000000fbc8 ffff880fe0689ab8 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: Call Trace: Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? check_for_next_transno+0x0/0x590 [ptlrpc] Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] target_recovery_overseer+0x9d/0x230 [ptlrpc] Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? autoremove_wake_function+0x0/0x40 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] target_recovery_thread+0x76a/0x1920 [ptlrpc] Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? default_wake_function+0x12/0x20 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] kthread+0x96/0xa0 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] child_rip+0xa/0x20 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? kthread+0x0/0xa0 Sep 4 10:04:28 cs04r-sc-mds03-02 kernel: [] ? child_rip+0x0/0x20 Sep 4 10:04:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821425/real 1409821425] req@ffff881f72afb000 x1477157865914588/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821480 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:04:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 4 10:06:17 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Denying connection for new client 317e54a4-dbda-ea53-80e5-e5088dcf34cd (at 172.23.82.234@tcp), waiting for all 70 known clients (38 recovered, 32 in progress, and 0 evicted) to recover in 8:17 Sep 4 10:06:17 cs04r-sc-mds03-02 kernel: Lustre: Skipped 2227 previous similar messages Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: INFO: task tgt_recov:44052 blocked for more than 120 seconds. Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: tgt_recov D 0000000000000001 0 44052 2 0x00000080 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: ffff880fe1539da0 0000000000000046 ffff880fe1539d00 ffff880fe1539d64 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: ffff881f00000000 ffff88103fc28800 0000000000000046 0000000000000046 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: ffff880fe0689ab8 ffff880fe1539fd8 000000000000fbc8 ffff880fe0689ab8 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: Call Trace: Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? check_for_next_transno+0x0/0x590 [ptlrpc] Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] target_recovery_overseer+0x9d/0x230 [ptlrpc] Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? autoremove_wake_function+0x0/0x40 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] target_recovery_thread+0x76a/0x1920 [ptlrpc] Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? default_wake_function+0x12/0x20 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] kthread+0x96/0xa0 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] child_rip+0xa/0x20 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? kthread+0x0/0xa0 Sep 4 10:06:28 cs04r-sc-mds03-02 kernel: [] ? child_rip+0x0/0x20 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: INFO: task tgt_recov:44052 blocked for more than 120 seconds. Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: tgt_recov D 0000000000000001 0 44052 2 0x00000080 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: ffff880fe1539da0 0000000000000046 ffff880fe1539d00 ffff880fe1539d64 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: ffff881f00000000 ffff88103fc28800 0000000000000046 0000000000000046 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: ffff880fe0689ab8 ffff880fe1539fd8 000000000000fbc8 ffff880fe0689ab8 Sep 4 10:08:28 cs04r-sc-mds03-02 kernel: Call Trace: Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? check_for_next_transno+0x0/0x590 [ptlrpc] Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] target_recovery_overseer+0x9d/0x230 [ptlrpc] Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? autoremove_wake_function+0x0/0x40 Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] target_recovery_thread+0x76a/0x1920 [ptlrpc] Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? default_wake_function+0x12/0x20 Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] kthread+0x96/0xa0 Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] child_rip+0xa/0x20 Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? kthread+0x0/0xa0 Sep 4 10:08:29 cs04r-sc-mds03-02 kernel: [] ? child_rip+0x0/0x20 Sep 4 10:09:11 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 10:09:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409821725/real 1409821725] req@ffff880fd40fbc00 x1477157865914700/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409821780 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:09:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Sep 4 10:10:13 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:13 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:13 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880fde0b4c00 x1477164410653452/t0(0) o400->73dc5093-e411-5e4b-09fe-acdc969dc29b@10.144.140.24@o2ib:0/0 lens 224/0 e 1 to 0 dl 1409821093 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Sep 4 10:10:14 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:14 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:14 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880fc9cfc800 x1477161319262524/t0(0) o400->b4fc08ec-c8f1-e8ac-0cc3-ec4ccfb644b9@10.144.148.5@o2ib:0/0 lens 224/0 e 1 to 0 dl 1409821094 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: Skipped 2 previous similar messages Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: 44050:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: 44050:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 2 previous similar messages Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: 44050:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff881fc2e73800 x1477164376758036/t0(0) o400->d5017538-02b0-2b3d-8db1-d5cfd2c2cdb4@10.144.140.18@o2ib:0/0 lens 224/0 e 1 to 0 dl 1409821096 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Sep 4 10:10:16 cs04r-sc-mds03-02 kernel: Lustre: 44050:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages Sep 4 10:10:18 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client 73dc5093-e411-5e4b-09fe-acdc969dc29b (at 10.144.140.24@o2ib) reconnecting, waiting for 70 clients in recovery for 4:15 Sep 4 10:10:18 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client 73dc5093-e411-5e4b-09fe-acdc969dc29b (at 10.144.140.24@o2ib) refused reconnection, still busy with 1 active RPCs Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: Skipped 4 previous similar messages Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=3 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 4 previous similar messages Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880fabfcfc00 x1477159873722396/t0(0) o101->3dddf195-6c14-0382-3584-f23c10ca3089@10.144.140.45@o2ib:0/0 lens 328/0 e 1 to 0 dl 1409821099 ref 2 fl Complete:/40/ffffffff rc 0/-1 Sep 4 10:10:19 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 8 previous similar messages Sep 4 10:10:20 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client f9c46ed7-4938-e016-8327-5b9348e02c11 (at 10.144.140.28@o2ib) reconnecting, waiting for 70 clients in recovery for 4:14 Sep 4 10:10:20 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:10:20 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client f9c46ed7-4938-e016-8327-5b9348e02c11 (at 10.144.140.28@o2ib) refused reconnection, still busy with 1 active RPCs Sep 4 10:10:20 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:10:21 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client d5017538-02b0-2b3d-8db1-d5cfd2c2cdb4 (at 10.144.140.18@o2ib) reconnecting, waiting for 70 clients in recovery for 4:12 Sep 4 10:10:21 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:10:21 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client d5017538-02b0-2b3d-8db1-d5cfd2c2cdb4 (at 10.144.140.18@o2ib) refused reconnection, still busy with 1 active RPCs Sep 4 10:10:21 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: Skipped 15 previous similar messages Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=5 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 15 previous similar messages Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880f01c5c000 x1477161376631540/t0(0) o101->b7ea9fd8-7c7a-26af-2a08-23f2dee92771@10.144.148.8@o2ib:0/0 lens 328/0 e 1 to 0 dl 1409821104 ref 2 fl Complete:/40/ffffffff rc 0/-1 Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: 43900:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client a1af0604-b9b7-6114-f91c-2d845fcfa6b6 (at 10.144.148.11@o2ib) reconnecting, waiting for 70 clients in recovery for 4:10 Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: Skipped 4 previous similar messages Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client a1af0604-b9b7-6114-f91c-2d845fcfa6b6 (at 10.144.148.11@o2ib) refused reconnection, still busy with 5 active RPCs Sep 4 10:10:24 cs04r-sc-mds03-02 kernel: Lustre: Skipped 4 previous similar messages Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: INFO: task tgt_recov:44052 blocked for more than 120 seconds. Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Not tainted 2.6.32-431.17.1.el6_lustre.x86_64 #1 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: tgt_recov D 0000000000000001 0 44052 2 0x00000080 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: ffff880fe1539da0 0000000000000046 ffff880fe1539d00 ffff880fe1539d64 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: ffff881f00000000 ffff88103fc28800 0000000000000046 0000000000000046 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: ffff880fe0689ab8 ffff880fe1539fd8 000000000000fbc8 ffff880fe0689ab8 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Call Trace: Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? check_for_next_transno+0x0/0x590 [ptlrpc] Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] target_recovery_overseer+0x9d/0x230 [ptlrpc] Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc] Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? autoremove_wake_function+0x0/0x40 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] target_recovery_thread+0x76a/0x1920 [ptlrpc] Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? default_wake_function+0x12/0x20 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? target_recovery_thread+0x0/0x1920 [ptlrpc] Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] kthread+0x96/0xa0 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] child_rip+0xa/0x20 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? kthread+0x0/0xa0 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: [] ? child_rip+0x0/0x20 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client b7ea9fd8-7c7a-26af-2a08-23f2dee92771 (at 10.144.148.8@o2ib) reconnecting, waiting for 70 clients in recovery for 4:05 Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Lustre: Skipped 10 previous similar messages Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client b7ea9fd8-7c7a-26af-2a08-23f2dee92771 (at 10.144.148.8@o2ib) refused reconnection, still busy with 5 active RPCs Sep 4 10:10:29 cs04r-sc-mds03-02 kernel: Lustre: Skipped 10 previous similar messages Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1401 previous similar messages Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: 43897:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: 43897:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 1401 previous similar messages Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: 43897:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880fdbd4c000 x1477785492006728/t0(0) o400->236562b1-33c3-e0f4-8b11-bf713229c9a7@10.144.140.1@o2ib:0/0 lens 224/0 e 1 to 0 dl 1409821112 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Sep 4 10:10:32 cs04r-sc-mds03-02 kernel: Lustre: 43897:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 1420 previous similar messages Sep 4 10:10:37 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client b908e27b-d80f-6cb4-ed4f-20ad973da908 (at 10.144.140.10@o2ib) reconnecting, waiting for 70 clients in recovery for 3:56 Sep 4 10:10:37 cs04r-sc-mds03-02 kernel: Lustre: Skipped 13 previous similar messages Sep 4 10:10:37 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client b908e27b-d80f-6cb4-ed4f-20ad973da908 (at 10.144.140.10@o2ib) refused reconnection, still busy with 1 active RPCs Sep 4 10:10:37 cs04r-sc-mds03-02 kernel: Lustre: Skipped 13 previous similar messages Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: mdt: This server is not able to keep up with request traffic (cpu-bound). Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: 44261:0:(service.c:1507:ptlrpc_at_check_timed()) earlyQ=2 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: 44261:0:(service.c:1507:ptlrpc_at_check_timed()) Skipped 8 previous similar messages Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: 44261:0:(service.c:1304:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-720s), not sending early reply. Consider increasing at_early_margin (5)? req@ffff880fc2f1f800 x1477164390390156/t0(0) o101->efa920d9-02ca-1dfe-ed8c-392e71004f22@10.144.140.22@o2ib:0/0 lens 328/0 e 1 to 0 dl 1409821129 ref 2 fl Complete:/40/ffffffff rc 0/-1 Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: 44261:0:(service.c:1304:ptlrpc_at_send_early_reply()) Skipped 18 previous similar messages Sep 4 10:10:49 cs04r-sc-mds03-02 kernel: Lustre: Skipped 9689 previous similar messages Sep 4 10:10:55 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client efa920d9-02ca-1dfe-ed8c-392e71004f22 (at 10.144.140.22@o2ib) reconnecting, waiting for 70 clients in recovery for 3:39 Sep 4 10:10:55 cs04r-sc-mds03-02 kernel: Lustre: Skipped 8 previous similar messages Sep 4 10:10:55 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client efa920d9-02ca-1dfe-ed8c-392e71004f22 (at 10.144.140.22@o2ib) refused reconnection, still busy with 12211 active RPCs Sep 4 10:10:55 cs04r-sc-mds03-02 kernel: Lustre: Skipped 8 previous similar messages Sep 4 10:11:27 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client 7370f304-797d-a51e-55c6-3163f15fe32c (at 10.144.148.2@o2ib) reconnecting, waiting for 70 clients in recovery for 0:38 Sep 4 10:11:27 cs04r-sc-mds03-02 kernel: Lustre: Skipped 38 previous similar messages Sep 4 10:11:27 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client eb0003b9-85a1-8083-f921-d64a55fb5972 (at 10.144.140.31@o2ib) refused reconnection, still busy with 1 active RPCs Sep 4 10:11:27 cs04r-sc-mds03-02 kernel: Lustre: Skipped 27 previous similar messages Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: Lustre: 44052:0:(ldlm_lib.c:2088:target_recovery_thread()) too long recovery - read logs Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: LustreError: dumping log to /tmp/lustre-log.1409821891.44052 Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Recovery over after 13:48, of 70 clients 70 recovered and 0 were evicted. Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: LustreError: 44732:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 8 of 32 Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: LustreError: 44732:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 126 previous similar messages Sep 4 10:11:31 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client d5017538-02b0-2b3d-8db1-d5cfd2c2cdb4 (at 10.144.140.18@o2ib) reconnecting Sep 4 10:11:32 cs04r-sc-mds03-02 kernel: LustreError: 44422:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 0 of 32 Sep 4 10:11:32 cs04r-sc-mds03-02 kernel: LustreError: 44422:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 231 previous similar messages Sep 4 10:11:33 cs04r-sc-mds03-02 kernel: LustreError: 44107:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 2 of 32 Sep 4 10:11:33 cs04r-sc-mds03-02 kernel: LustreError: 44107:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 796 previous similar messages Sep 4 10:11:33 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client cff513e1-95e6-9d6d-db11-ba402dec73a7 (at 10.144.140.21@o2ib) reconnecting Sep 4 10:11:33 cs04r-sc-mds03-02 kernel: Lustre: Skipped 2 previous similar messages Sep 4 10:11:34 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client 581391ca-4c94-702e-f940-dabe74c30d17 (at 10.144.140.6@o2ib) reconnecting Sep 4 10:11:35 cs04r-sc-mds03-02 kernel: LustreError: 43946:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 8 of 32 Sep 4 10:11:35 cs04r-sc-mds03-02 kernel: LustreError: 43946:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 1394 previous similar messages Sep 4 10:11:37 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client b8ba9dd4-f60c-5e3a-5441-a06bd4180225 (at 10.144.140.30@o2ib) reconnecting Sep 4 10:11:37 cs04r-sc-mds03-02 kernel: Lustre: Skipped 4 previous similar messages Sep 4 10:11:39 cs04r-sc-mds03-02 kernel: LustreError: 43956:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 27 of 32 Sep 4 10:11:39 cs04r-sc-mds03-02 kernel: LustreError: 43956:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 2161 previous similar messages Sep 4 10:11:41 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client d915d417-858f-ffbe-e9b3-d21844d8200b (at 10.144.140.29@o2ib) reconnecting Sep 4 10:11:41 cs04r-sc-mds03-02 kernel: Lustre: Skipped 3 previous similar messages Sep 4 10:11:47 cs04r-sc-mds03-02 kernel: LustreError: 44103:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 14 of 32 Sep 4 10:11:47 cs04r-sc-mds03-02 kernel: LustreError: 44103:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 5511 previous similar messages Sep 4 10:11:52 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Client eb0003b9-85a1-8083-f921-d64a55fb5972 (at 10.144.140.31@o2ib) reconnecting Sep 4 10:11:52 cs04r-sc-mds03-02 kernel: Lustre: Skipped 10 previous similar messages Sep 4 10:12:03 cs04r-sc-mds03-02 kernel: LustreError: 43904:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 15 of 32 Sep 4 10:12:03 cs04r-sc-mds03-02 kernel: LustreError: 43904:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 12897 previous similar messages Sep 4 10:12:35 cs04r-sc-mds03-02 kernel: LustreError: 44107:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 8 of 32 Sep 4 10:12:35 cs04r-sc-mds03-02 kernel: LustreError: 44107:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 32795 previous similar messages Sep 4 10:13:39 cs04r-sc-mds03-02 kernel: LustreError: 44732:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 22 of 32 Sep 4 10:13:39 cs04r-sc-mds03-02 kernel: LustreError: 44732:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 79547 previous similar messages Sep 4 10:15:47 cs04r-sc-mds03-02 kernel: LustreError: 44420:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 27 of 32 Sep 4 10:15:47 cs04r-sc-mds03-02 kernel: LustreError: 44420:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 260053 previous similar messages Sep 4 10:18:25 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409822250/real 1409822250] req@ffff880fd1f9b000 x1477157865914956/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409822305 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:18:25 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Sep 4 10:20:03 cs04r-sc-mds03-02 kernel: LustreError: 44343:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 26 of 32 Sep 4 10:20:03 cs04r-sc-mds03-02 kernel: LustreError: 44343:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 1306785 previous similar messages Sep 4 10:25:48 cs04r-sc-mds03-02 kernel: LNet: There was an unexpected network error while writing to 172.23.73.33: -110. Sep 4 10:28:35 cs04r-sc-mds03-02 kernel: LustreError: 1508:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 9 of 32 Sep 4 10:28:35 cs04r-sc-mds03-02 kernel: LustreError: 1508:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 1112084 previous similar messages Sep 4 10:28:53 cs04r-sc-mds03-02 kernel: Lustre: MGS: haven't heard from client 318d5d0a-ff1f-d42b-6072-0e5bbff90a39 (at 172.23.73.33@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881fd6dbc400, cur 1409822933 expire 1409822783 last 1409822706 Sep 4 10:29:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409822925/real 1409822925] req@ffff880a42bf6c00 x1477157865915216/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409822980 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:29:40 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Sep 4 10:35:48 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 10:35:48 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:38:35 cs04r-sc-mds03-02 kernel: LustreError: 44105:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 8 of 32 Sep 4 10:38:35 cs04r-sc-mds03-02 kernel: LustreError: 44105:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 727545 previous similar messages Sep 4 10:40:55 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409823600/real 1409823600] req@ffff880a856f9000 x1477157865915472/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409823655 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:40:55 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Sep 4 10:48:35 cs04r-sc-mds03-02 kernel: LustreError: 43902:0:(lod_lov.c:674:validate_lod_and_idx()) lustre03-MDT0000-mdtlov: bad idx: 28 of 32 Sep 4 10:48:35 cs04r-sc-mds03-02 kernel: LustreError: 43902:0:(lod_lov.c:674:validate_lod_and_idx()) Skipped 336222 previous similar messages Sep 4 10:52:10 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409824275/real 1409824275] req@ffff88094a92f800 x1477157865915724/t0(0) o250->MGC10.144.144.1@o2ib@10.144.144.1@o2ib:26/25 lens 400/544 e 0 to 1 dl 1409824330 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 4 10:52:10 cs04r-sc-mds03-02 kernel: Lustre: 17803:0:(client.c:1908:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Sep 4 10:54:10 cs04r-sc-mds03-02 kernel: Lustre: MGS: non-config logname received: params Sep 4 10:54:34 cs04r-sc-mds03-02 kernel: Lustre: MGS: haven't heard from client d931487f-eb86-6ad4-01c6-a80dfbbd57d1 (at 172.23.136.7@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881fd01c1800, cur 1409824474 expire 1409824324 last 1409824247 Sep 4 10:57:43 cs04r-sc-mds03-02 kernel: Lustre: Failing over lustre03-MDT0000 Sep 4 10:57:43 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 172.23.130.21@tcp (stopping) Sep 4 10:57:43 cs04r-sc-mds03-02 kernel: Lustre: Skipped 1 previous similar message Sep 4 10:57:44 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 10.144.140.50@o2ib (stopping) Sep 4 10:57:44 cs04r-sc-mds03-02 kernel: Lustre: Skipped 3 previous similar messages Sep 4 10:57:45 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 172.23.146.23@tcp (stopping) Sep 4 10:57:45 cs04r-sc-mds03-02 kernel: Lustre: Skipped 12 previous similar messages Sep 4 10:57:47 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 10.144.140.37@o2ib (stopping) Sep 4 10:57:47 cs04r-sc-mds03-02 kernel: Lustre: Skipped 21 previous similar messages Sep 4 10:57:49 cs04r-sc-mds03-02 kernel: Lustre: server umount MGS complete Sep 4 10:57:51 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 172.23.132.16@tcp (stopping) Sep 4 10:57:51 cs04r-sc-mds03-02 kernel: Lustre: Skipped 36 previous similar messages Sep 4 10:57:59 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 172.23.146.14@tcp (stopping) Sep 4 10:57:59 cs04r-sc-mds03-02 kernel: Lustre: Skipped 82 previous similar messages Sep 4 10:58:16 cs04r-sc-mds03-02 kernel: Lustre: lustre03-MDT0000: Not available for connect from 172.23.136.44@tcp (stopping) Sep 4 10:58:16 cs04r-sc-mds03-02 kernel: Lustre: Skipped 109 previous similar messages Sep 4 10:58:34 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.140.15@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 10:58:34 cs04r-sc-mds03-02 kernel: LustreError: Skipped 15 previous similar messages Sep 4 10:58:37 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 172.23.132.104@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 10:58:37 cs04r-sc-mds03-02 kernel: LustreError: Skipped 5 previous similar messages Sep 4 10:58:41 cs04r-sc-mds03-02 kernel: LustreError: 137-5: lustre03-MDT0000_UUID: not available for connect from 10.144.148.5@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 4 10:58:41 cs04r-sc-mds03-02 kernel: LustreError: Skipped 11 previous similar messages Sep 4 10:58:48 cs04r-sc-mds03-02 kernel: Lustre: server umount lustre03-MDT0000 complete