[LU-5255] MDS grinds to halt - reboot required Created: 25/Jun/14 Updated: 26/Jan/17 Resolved: 26/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Daire Byrne (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 14663 |
| Description |
|
We have hit this at least 3 times in as many weeks on our MDS server. A reboot is required to bring the filesystem back to life. Here is the relevant part of /var/log/messages Jun 25 13:55:53 bmds1 kernel: LNet: Service thread pid 22737 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 13:55:53 bmds1 kernel: Pid: 22737, comm: mdt00_021 Jun 25 13:55:53 bmds1 kernel: Jun 25 13:55:53 bmds1 kernel: Call Trace: Jun 25 13:55:53 bmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 Jun 25 13:55:53 bmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e66d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06be01d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df04f8>] mdt_object_open_lock+0x1c8/0x510 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd4bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df56b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df629b>] mdt_reint_open+0x56b/0x20c0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa040282e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e6dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa057df50>] ? lu_ucred+0x20/0x30 [obdclass] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1015>] ? mdt_ucred+0x15/0x20 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0ddd15c>] ? mdt_root_squash+0x2c/0x410 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa057df50>] ? lu_ucred+0x20/0x30 [obdclass] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0de1911>] mdt_reint_rec+0x41/0xe0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc6ae3>] mdt_reint_internal+0x4c3/0x780 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc706d>] mdt_intent_reint+0x1ed/0x520 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 13:55:53 bmds1 kernel: Jun 25 13:55:53 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403700953.22737 Jun 25 13:55:53 bmds1 kernel: Lustre: lock timed out (enqueued at 1403700753, 200s ago) Jun 25 13:55:53 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403700953.6869 Jun 25 13:55:53 bmds1 kernel: LNet: Service thread pid 6869 was inactive for 200.26s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 13:55:53 bmds1 kernel: Pid: 6869, comm: mdt01_003 Jun 25 13:55:53 bmds1 kernel: Jun 25 13:55:53 bmds1 kernel: Call Trace: Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e676e>] ? cfs_waitq_del+0xe/0x10 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 13:55:53 bmds1 kernel: Jun 25 13:56:43 bmds1 kernel: LNet: Service thread pid 22446 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 13:56:43 bmds1 kernel: Pid: 22446, comm: mdt02_016 Jun 25 13:56:43 bmds1 kernel: Jun 25 13:56:43 bmds1 kernel: Call Trace: Jun 25 13:56:43 bmds1 kernel: Lustre: lock timed out (enqueued at 1403700803, 200s ago) Jun 25 13:56:43 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06b7643>] ? ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03e66fe>] ? cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06be0aa>] ? ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bd758>] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc7c7b>] ? mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc84f4>] ? mdt_object_lock+0x14/0x20 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dd75a9>] ? mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dd83ad>] ? mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc4f1e>] ? mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa069e831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06c51ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc53a6>] ? mdt_enqueue+0x46/0xe0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dcba97>] ? mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0e053f5>] ? mds_regular_handle+0x15/0x20 [mdt] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f73c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f875e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 13:56:43 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 13:56:43 bmds1 kernel: Jun 25 13:56:43 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403701003.22446 Jun 25 14:06:00 bmds1 kernel: Lustre: 22504:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Jun 25 14:06:00 bmds1 kernel: req@ffff88217d305400 x1468581949625152/t0(0) o101->0bb08de7-0631-0e52-3ff1-946375c89d85@10.21.22.25@tcp:0/0 lens 576/3448 e 5 to 0 dl 1403701565 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:06:00 bmds1 kernel: Lustre: 22402:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Jun 25 14:06:00 bmds1 kernel: req@ffff882d71613800 x1468571541943816/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/1152 e 5 to 0 dl 1403701565 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:06:50 bmds1 kernel: Lustre: 6920:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Jun 25 14:06:50 bmds1 kernel: req@ffff8833cbe53c00 x1468568450342728/t0(0) o101->b69b8925-7476-e90c-d5d5-edb6b51853ec@10.21.22.24@tcp:0/0 lens 608/3448 e 5 to 0 dl 1403701615 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:08:21 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:08:21 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:08:46 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:08:46 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:08:46 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:08:46 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:09:11 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:09:11 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:09:11 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:09:11 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:09:36 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:09:36 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:09:36 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:09:36 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:09:44 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 10.21.22.25@tcp ns: mdt-bravo-MDT0000_UUID lock: ffff881d38f2f240/0x6ae950 Jun 25 14:10:01 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:10:01 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:10:01 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:10:01 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:10:26 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:10:26 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:10:26 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:10:26 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:11:16 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:11:16 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:11:16 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:11:16 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:11:46 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 2200s: evicting client at 10.21.22.23@tcp ns: mdt-bravo-MDT0000_UUID lock: ffff881656d94480/0x6ae90 Jun 25 14:13:23 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701803, 200s ago) Jun 25 14:15:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701926, 200s ago) Jun 25 14:16:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701986, 200s ago) Jun 25 14:16:52 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702012, 200s ago) Jun 25 14:18:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702106, 200s ago) Jun 25 14:19:26 bmds1 kernel: LNet: Service thread pid 6870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:19:26 bmds1 kernel: Pid: 6870, comm: mdt00_003 Jun 25 14:19:26 bmds1 kernel: Jun 25 14:19:26 bmds1 kernel: Call Trace: Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06b7643>] ? ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 Jun 25 14:19:26 bmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03e66d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06be01d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:19:26 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:19:26 bmds1 kernel: Jun 25 14:19:26 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702366.6870 Jun 25 14:19:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702166, 200s ago) Jun 25 14:21:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702286, 200s ago) Jun 25 14:22:27 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702347, 200s ago) Jun 25 14:22:28 bmds1 kernel: LNet: Service thread pid 22401 was inactive for 262.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:22:28 bmds1 kernel: Pid: 22401, comm: mdt00_014 Jun 25 14:22:28 bmds1 kernel: Jun 25 14:22:28 bmds1 kernel: Call Trace: Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:22:28 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:22:28 bmds1 kernel: Jun 25 14:22:28 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702548.22401 Jun 25 14:22:33 bmds1 kernel: Lustre: 6944:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:22:33 bmds1 kernel: req@ffff881dbaa01800 x1468581963167316/t0(0) o101->0bb08de7-0631-0e52-3ff1-946375c89d85@10.21.22.25@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403702558 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:22:39 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 14:22:39 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:22:39 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:22:39 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:23:04 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 14:23:04 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:23:29 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 14:23:29 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:23:29 bmds1 kernel: LNet: Service thread pid 6908 was inactive for 262.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:23:29 bmds1 kernel: Pid: 6908, comm: mdt00_006 Jun 25 14:23:29 bmds1 kernel: Jun 25 14:23:29 bmds1 kernel: Call Trace: Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:23:29 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:23:29 bmds1 kernel: Jun 25 14:23:29 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702609.6908 Jun 25 14:24:19 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 14:24:19 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:24:19 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:24:19 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:24:24 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 108s: evicting client at 10.21.22.25@tcp ns: mdt-bravo-MDT0000_UUID lock: ffff882efc845000/0x6ae950 Jun 25 14:24:24 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message Jun 25 14:24:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702466, 200s ago) Jun 25 14:24:36 bmds1 kernel: Lustre: 22403:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:24:36 bmds1 kernel: req@ffff880d89de9000 x1468571551937816/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702681 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:25:32 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:25:32 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:25:32 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 8 active RPCs Jun 25 14:25:32 bmds1 kernel: Lustre: Skipped 2 previous similar messages Jun 25 14:25:36 bmds1 kernel: Lustre: 30120:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:25:36 bmds1 kernel: req@ffff88204db56800 x1468571552939332/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702741 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:26:02 bmds1 kernel: Lustre: 22450:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:26:02 bmds1 kernel: req@ffff88404d9c8400 x1468568453283516/t0(0) o101->b69b8925-7476-e90c-d5d5-edb6b51853ec@10.21.22.24@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403702767 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:26:27 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 711s: evicting client at 10.21.22.23@tcp ns: mdt-bravo-MDT0000_UUID lock: ffff88022e542900/0x6ae950 Jun 25 14:27:36 bmds1 kernel: Lustre: 30242:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:27:36 bmds1 kernel: req@ffff8815c622b400 x1468571555018544/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702861 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:27:48 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) reconnecting Jun 25 14:27:48 bmds1 kernel: Lustre: Skipped 6 previous similar messages Jun 25 14:27:48 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:27:48 bmds1 kernel: Lustre: Skipped 6 previous similar messages Jun 25 14:29:08 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 728s: evicting client at 10.21.22.24@tcp ns: mdt-bravo-MDT0000_UUID lock: ffff883a8d51f000/0x6ae950 Jun 25 14:29:40 bmds1 kernel: LNet: Service thread pid 22365 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:29:40 bmds1 kernel: Pid: 22365, comm: mdt00_013 Jun 25 14:29:40 bmds1 kernel: Jun 25 14:29:40 bmds1 kernel: Call Trace: Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:29:40 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:29:40 bmds1 kernel: Jun 25 14:29:40 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702980.22365 Jun 25 14:30:03 bmds1 kernel: LNet: Service thread pid 6954 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:30:03 bmds1 kernel: Pid: 6954, comm: mdt01_011 Jun 25 14:30:03 bmds1 kernel: Jun 25 14:30:03 bmds1 kernel: Call Trace: Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:03 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:30:03 bmds1 kernel: Jun 25 14:30:03 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703003.6954 Jun 25 14:30:40 bmds1 kernel: LNet: Service thread pid 6933 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:30:40 bmds1 kernel: Pid: 6933, comm: mdt00_009 Jun 25 14:30:40 bmds1 kernel: Jun 25 14:30:40 bmds1 kernel: Call Trace: Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:30:40 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:30:40 bmds1 kernel: Jun 25 14:30:40 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703040.6933 Jun 25 14:32:06 bmds1 kernel: LNet: Service thread pid 6937 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:32:06 bmds1 kernel: Pid: 6937, comm: mdt00_010 Jun 25 14:32:06 bmds1 kernel: Jun 25 14:32:06 bmds1 kernel: Call Trace: Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:32:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:32:06 bmds1 kernel: Jun 25 14:32:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703126.6937 Jun 25 14:32:29 bmds1 kernel: Lustre: 22454:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply Jun 25 14:32:29 bmds1 kernel: req@ffff881e34688000 x1468571555875772/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 4 to 0 dl 1403703154 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:32:52 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702972, 200s ago) Jun 25 14:32:52 bmds1 kernel: Lustre: Skipped 1 previous similar message Jun 25 14:33:06 bmds1 kernel: LNet: Service thread pid 6910 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:33:06 bmds1 kernel: Pid: 6910, comm: mdt00_008 Jun 25 14:33:06 bmds1 kernel: Jun 25 14:33:06 bmds1 kernel: Call Trace: Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:33:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:33:06 bmds1 kernel: Jun 25 14:33:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703186.6910 Jun 25 14:33:32 bmds1 kernel: LNet: Service thread pid 6612 was inactive for 1200.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Jun 25 14:33:32 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703212.6612 Jun 25 14:34:56 bmds1 kernel: Lustre: 6957:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-410), not sending early reply Jun 25 14:34:56 bmds1 kernel: req@ffff883b77b15400 x1468571557733836/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 2 to 0 dl 1403703301 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:35:06 bmds1 kernel: LNet: Service thread pid 6905 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:35:06 bmds1 kernel: Pid: 6905, comm: mdt00_005 Jun 25 14:35:06 bmds1 kernel: Jun 25 14:35:06 bmds1 kernel: Call Trace: Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:35:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:35:06 bmds1 kernel: Jun 25 14:35:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703306.6905 Jun 25 14:35:57 bmds1 kernel: Lustre: 22588:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-410), not sending early reply Jun 25 14:35:57 bmds1 kernel: req@ffff881fb5a31000 x1468571558663648/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 2 to 0 dl 1403703362 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:42:02 bmds1 kernel: Lustre: 22454:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jun 25 14:42:02 bmds1 kernel: req@ffff88113611a800 x1468571563425972/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403703727 ref 2 fl Interpret:/0/0 rc 0/0 Jun 25 14:42:08 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:42:08 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:42:08 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:42:08 bmds1 kernel: Lustre: Skipped 3 previous similar messages Jun 25 14:49:32 bmds1 kernel: LNet: Service thread pid 6605 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 25 14:49:32 bmds1 kernel: Pid: 6605, comm: mdt00_000 Jun 25 14:49:32 bmds1 kernel: Jun 25 14:49:32 bmds1 kernel: Call Trace: Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:32 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:49:32 bmds1 kernel: Jun 25 14:49:32 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704172.6605 Jun 25 14:49:34 bmds1 kernel: Pid: 22449, comm: mdt02_019 Jun 25 14:49:34 bmds1 kernel: Jun 25 14:49:34 bmds1 kernel: Call Trace: Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:34 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:49:34 bmds1 kernel: Jun 25 14:49:34 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704174.22449 Jun 25 14:49:36 bmds1 kernel: Pid: 22504, comm: mdt01_021 Jun 25 14:49:36 bmds1 kernel: Jun 25 14:49:36 bmds1 kernel: Call Trace: Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jun 25 14:49:36 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 25 14:49:36 bmds1 kernel: Jun 25 14:49:36 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704176.22504 Jun 25 14:50:53 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting Jun 25 14:50:53 bmds1 kernel: Lustre: Skipped 62 previous similar messages Jun 25 14:50:53 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 14:50:53 bmds1 kernel: Lustre: Skipped 62 previous similar messages Jun 25 15:00:55 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) reconnecting Jun 25 15:00:55 bmds1 kernel: Lustre: Skipped 72 previous similar messages Jun 25 15:00:55 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 15:00:55 bmds1 kernel: Lustre: Skipped 72 previous similar messages Jun 25 15:10:57 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting Jun 25 15:10:57 bmds1 kernel: Lustre: Skipped 72 previous similar messages Jun 25 15:10:57 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs Jun 25 15:10:57 bmds1 kernel: Lustre: Skipped 72 previous similar messages Jun 25 15:18:37 bmds1 init: tty (/dev/tty2) main process (8061) killed by TERM signal Jun 25 15:18:37 bmds1 init: tty (/dev/tty3) main process (8063) killed by TERM signal Jun 25 15:18:37 bmds1 init: tty (/dev/tty4) main process (8065) killed by TERM signal Jun 25 15:18:37 bmds1 init: tty (/dev/tty5) main process (8069) killed by TERM signal Jun 25 15:18:37 bmds1 init: tty (/dev/tty6) main process (8071) killed by TERM signal |
| Comments |
| Comment by Daire Byrne (Inactive) [ 25/Jun/14 ] |
|
And on a related note - why do NFS exporters not recover seamlessly and subsequently the NFS clients? We have lots of reports of IO errors being returned to NFS clients which then requires that we restart all the running tasks. With a combination of Lustre's recovery and hard mounted NFSv3 clients this shouldn't happen right? It seems like new opens/closes failed on the NFS clients but existing open files did recover as expected (the process hung until the filesystem came back). Why do new open/closes return IO errors instead of hanging until the filesystem comes back online? |
| Comment by Oleg Drokin [ 25/Jun/14 ] |
|
Do you have many stripes in use, how many OSTs in this system? Also any problems reported on the clients around the time when you have MDS enter this state? |
| Comment by Daire Byrne (Inactive) [ 25/Jun/14 ] |
|
We don't stripe - there are 102 OSTs across 6 servers. A snippet of the interesting bits of a client log: Jun 25 14:07:17 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:885 for /user_data (/user_data) Jun 25 14:07:56 lustre5 kernel: Lustre: 15430:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403700753/real 1403700753] req@ffff88039237f400 x1468581949625152/t0(0) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 576/3448 e 5 to 1 dl 1403701676 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jun 25 14:07:56 lustre5 kernel: Lustre: 15430:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 22 previous similar messages Jun 25 14:07:56 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete Jun 25 14:07:56 lustre5 kernel: Lustre: Skipped 16 previous similar messages Jun 25 14:07:56 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:07:56 lustre5 kernel: LustreError: Skipped 1 previous similar message Jun 25 14:08:21 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:08:46 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:09:11 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:09:36 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:10:01 lustre5 kernel: LustreError: 167-0: bravo-MDT0000-mdc-ffff880603c1cc00: This client was evicted by bravo-MDT0000; in progress operations using this service will fail. Jun 25 14:10:01 lustre5 kernel: LustreError: 15443:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -5 Jun 25 14:10:01 lustre5 kernel: LustreError: 12482:0:(file.c:163:ll_close_inode_openhandle()) inode 144116128220761786 mdc close failed: rc = -5 Jun 25 14:10:01 lustre5 kernel: LustreError: 12482:0:(file.c:163:ll_close_inode_openhandle()) Skipped 148 previous similar messages Jun 25 14:10:01 lustre5 kernel: LustreError: 15430:0:(file.c:2716:ll_inode_revalidate_fini()) bravo: revalidate FID [0x20000dae1:0x16c4e:0x0] error: rc = -5 Jun 25 14:10:01 lustre5 kernel: LustreError: 15430:0:(file.c:2716:ll_inode_revalidate_fini()) Skipped 3 previous similar messages Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -5 Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(lmv_obd.c:1289:lmv_statfs()) Skipped 1717 previous similar messages Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -5 Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(llite_lib.c:1610:ll_statfs_internal()) Skipped 1807 previous similar messages Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: LustreError: 8335:0:(vvp_io.c:1230:vvp_io_init()) bravo: refresh file layout [0x20000daa1:0x1ef7a:0x0] error -108. Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108 .. .. 8,000 lines of nfsd errno: -108 .. Jun 25 14:24:44 lustre5 kernel: LustreError: 17760:0:(ldlm_resource.c:804:ldlm_resource_complain()) bravo-MDT0000-mdc-ffff880603c1cc00: namespace resource [0x20000daa1:0x1ef7a:0x0].0 (ffff880850d55080) refcount nonzero (1) after lock cleanup; forcing cleanup. Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: LustreError: 17760:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x20000daa1:0x1ef7a:0x0].0 (ffff880850d55080) refcount = 2 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108 Jun 25 14:24:44 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection restored to bravo-MDT0000 (at 10.21.22.50@tcp) Jun 25 14:25:38 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.139:647 for /user_data (/user_data) Jun 25 14:27:18 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:774 for /user_data (/user_data) Jun 25 14:30:06 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.103:902 for /user_data (/user_data) Jun 25 14:30:24 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.56.211:863 for /user_data (/user_data) Jun 25 14:32:09 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.54.10:997 for /user_data (/user_data) Jun 25 14:32:46 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.206:963 for /user_data (/user_data) Jun 25 14:33:21 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.132:995 for /user_data (/user_data) Jun 25 14:36:44 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.56.211:818 for /user_data (/user_data) Jun 25 14:37:13 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:1002 for /user_data (/user_data) Jun 25 14:38:05 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.64:1020 for /user_data (/user_data) Jun 25 14:38:27 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.206:646 for /user_data (/user_data) Jun 25 14:38:47 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.103:830 for /user_data (/user_data) Jun 25 14:39:38 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.151:944 for /user_data (/user_data) Jun 25 14:40:12 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.102:896 for /user_data (/user_data) Jun 25 14:40:19 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.102:993 for /user_data (/user_data) Jun 25 14:42:12 lustre5 kernel: Lustre: 15436:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403702976/real 1403702976] req@ffff88085425d400 x1468581978403632/t0(0) o101->brav o-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 608/3448 e 0 to 1 dl 1403703732 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Jun 25 14:42:12 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete Jun 25 14:42:12 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:42:37 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:43:02 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:43:08 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.64:664 for /user_data (/user_data) Jun 25 14:43:23 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.132:750 for /user_data (/user_data) Jun 25 14:43:27 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:43:52 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:44:42 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:44:42 lustre5 kernel: LustreError: Skipped 1 previous similar message Jun 25 14:45:57 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:45:57 lustre5 kernel: LustreError: Skipped 2 previous similar messages Jun 25 14:47:45 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.102:779 for /user_data (/user_data) Jun 25 14:48:27 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:48:27 lustre5 kernel: LustreError: Skipped 5 previous similar messages Jun 25 14:53:02 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 14:53:02 lustre5 kernel: LustreError: Skipped 10 previous similar messages Jun 25 15:01:47 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 15:01:47 lustre5 kernel: LustreError: Skipped 20 previous similar messages Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -16 Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(lmv_obd.c:1289:lmv_statfs()) Skipped 5 previous similar messages Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16 Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(llite_lib.c:1610:ll_statfs_internal()) Skipped 5 previous similar messages Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:433:ll_get_dir_page()) read cache page: [0x200000007:0x1:0x0] at 0: rc -4 Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -4 Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:595:ll_dir_read()) Skipped 15 previous similar messages Jun 25 15:12:12 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16. Jun 25 15:12:12 lustre5 kernel: LustreError: Skipped 24 previous similar messages Jun 25 15:14:54 lustre5 kernel: LustreError: 22411:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -16 Jun 25 15:14:54 lustre5 kernel: LustreError: 22411:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16 Jun 25 15:16:47 lustre5 rshd[22567]: root@batch1.dhcp.dneg.com as root: cmd='/usr/bin/tail -100 /var/log/messages | egrep 'nfs|aquire_lock'' Jun 25 15:19:17 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff8805fd416000: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete Jun 25 15:22:44 lustre5 kernel: Lustre: 3862:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706157/real 1403706157] req@ffff8805813b3400 x1468581992549260/t0(0) o400->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 224/224 e 0 to 1 dl 1403706164 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:22:44 lustre5 kernel: LustreError: 166-1: MGC10.21.22.50@tcp: Connection to MGS (at 10.21.22.50@tcp) was lost; in progress operations using this service will fail Jun 25 15:22:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706164/real 1403706164] req@ffff88039c23a400 x1468581992550088/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706170 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:23:20 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706189/real 1403706192] req@ffff88035edce000 x1468581992550096/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706200 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:23:23 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706157/real 1403706157] req@ffff880407e56400 x1468581992549676/t0(0) o38->bravo-MDT0000-mdc-ffff8805fd416000@10.21.22.50@tcp:12/10 lens 400/544 e 0 to 1 dl 1403706203 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:23:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706214/real 1403706217] req@ffff8802a7ea1400 x1468581992550916/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706230 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:23:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jun 25 15:24:20 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706239/real 1403706242] req@ffff880add1ee800 x1468581992551744/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706260 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:24:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706264/real 1403706267] req@ffff880badd91c00 x1468581992552564/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706290 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:24:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 25 15:25:44 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706289/real 1403706292] req@ffff88099ed52c00 x1468581992553796/t0(0) o38->bravo-MDT0000-mdc-ffff8805fd416000@10.21.22.50@tcp:12/10 lens 400/544 e 0 to 1 dl 1403706344 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:26:40 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706364/real 1403706367] req@ffff8808a519e000 x1468581992555844/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706400 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:26:40 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 25 15:28:05 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706439/real 1403706441] req@ffff880a2ad93000 x1468581992558308/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706485 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 25 15:28:05 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Jun 25 15:28:34 lustre5 kernel: Lustre: Evicted from MGS (at 10.21.22.50@tcp) after server handle changed from 0x6ae95a77b3c55db to 0xca8b7428d5cc6ff2 Jun 25 15:28:34 lustre5 kernel: Lustre: MGC10.21.22.50@tcp: Connection restored to MGS (at 10.21.22.50@tcp) Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88036a75fc00 x1468581974497852/t177813398500(177813398500) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 576/544 e 0 to 0 dl 1403706571 ref 2 fl Interpret:RP/4/0 rc 301/301 Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) Skipped 39 previous similar messages Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff8809243c2800 x1468581983767496/t177813992103(177813992103) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 592/544 e 0 to 0 dl 1403706571 ref 2 fl Interpret:RP/4/0 rc 301/301 Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) Skipped 100 previous similar messages Jun 25 15:28:36 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff8805fd416000: Connection restored to bravo-MDT0000 (at 10.21.22.50@tcp) Jun 25 15:29:31 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.225:846 for /user_data (/user_data) Jun 25 15:30:07 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.206:753 for /user_data (/user_data) Jun 25 15:30:49 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.103:915 for /user_data (/user_data) Jun 25 15:31:31 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.102:835 for /user_data (/user_data) |
| Comment by Peter Jones [ 25/Jun/14 ] |
|
Bruno Oleg suspects that this may be an example of Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 27/Jun/14 ] |
|
Peter, according to the fact that there is no striping used, and that the situation does not clear/recover after MDS has evicted Clients, my feeling is that problem looks different than the scenarios addressed in Daire, you indicated that problems already occured 3 times in 3 weeks, right? So if yes, have you been able to take a forced crash-dump of the MDS instead to simply reboot it ? Or at least, did you take time to request a full stacks trace (Alt+SysRq+T or echo t > /proc/sys/sysrq-trigger) before to reboot MDS ? |
| Comment by Daire Byrne (Inactive) [ 27/Jun/14 ] |
|
No we did not take any traces before rebooting. I can certainly make sure the team here know to do that next time. And yes it has occurred 3-4 times in the last month. |
| Comment by Daire Byrne (Inactive) [ 14/Jul/14 ] |
|
So this just happened again but unfortunately we could not even get a local console on the server to trigger a sysreq. We had to power cycle. |
| Comment by Daire Byrne (Inactive) [ 14/Jul/14 ] |
|
So we managed to get a partial sysreq c/kdump. After we hit the power button we got a minute or so of console access. I'm attaching the dmesg but I'm not sure how useful the data is.... |
| Comment by Bruno Faccini (Inactive) [ 18/Jul/14 ] |
|
Hello Daire, More strange are the following processes/stacks : <6>rc S 0000000000000001 0 31964 1 0x00000000 <4> ffff8814a4671e78 0000000000000082 0000000000000000 0000000000000029 <4> ffff882d77417280 000000000238c0b8 ffff88107406a2d8 0000000000000007 <4> ffff885fc1d9c638 ffff8814a4671fd8 000000000000fb88 ffff885fc1d9c638 <4>Call Trace: <4> [<ffffffff81072a94>] do_wait+0x1e4/0x240 <4> [<ffffffff81072b93>] sys_wait4+0xa3/0x100 <4> [<ffffffff81071130>] ? child_wait_callback+0x0/0x70 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <6>K75netfs S 0000000000000001 0 32228 31964 0x00000000 <4> ffff882fe332fe78 0000000000000086 0000000000000000 0000000000000029 <4> ffff885fe4db68c0 0000000000f27990 ffff885fe6053c38 0000000000000007 <4> ffff885fe6168638 ffff882fe332ffd8 000000000000fb88 ffff885fe6168638 <4>Call Trace: <4> [<ffffffff81072a94>] do_wait+0x1e4/0x240 <4> [<ffffffff81072b93>] sys_wait4+0xa3/0x100 <4> [<ffffffff81071130>] ? child_wait_callback+0x0/0x70 <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <6>umount R running task 0 32242 32228 0x00000000 <4> ffff882d5be15748 0000000000000000 ffffea00645bc5e8 ffff88000001c9c0 <4> ffff882d5be15778 ffffffff8112f15d ffff882d5be15778 ffff88000001c9c0 <4> ffff88000001c9c0 ffff88000001c9c0 ffff882d5be15808 ffffffff81130245 <4>Call Trace: <4> [<ffffffff81170ae0>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70 <4> [<ffffffff8127cbf1>] radix_tree_gang_lookup_slot+0x81/0xf0 <4> [<ffffffffa0b75e67>] ? jbd2_journal_try_to_free_buffers+0xa7/0x150 [jbd2] <4> [<ffffffff8113031d>] ? ____pagevec_lru_deactivate+0x20d/0x2d0 <4> [<ffffffffa0c32901>] ? ldiskfs_fh_to_dentry+0x1/0x20 [ldiskfs] <4> [<ffffffff81130558>] ? deactivate_page+0x58/0x80 <4> [<ffffffff81130a85>] ? invalidate_mapping_pages+0x165/0x1e0 <4> [<ffffffff811b631b>] ? invalidate_bdev+0x3b/0x50 <4> [<ffffffff811e1a49>] ? vfs_quota_disable+0x3a9/0x7d0 <4> [<ffffffffa0c0dfa3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs] <4> [<ffffffffa0c36900>] ? ldiskfs_quota_off+0xe0/0x1f0 [ldiskfs] <4> [<ffffffff81183eb6>] ? deactivate_super+0x46/0x80 <4> [<ffffffff811a21bf>] ? mntput_no_expire+0xbf/0x110 <4> [<ffffffffa0c816d9>] ? osd_umount+0x79/0x150 [osd_ldiskfs] <4> [<ffffffffa0c864c7>] ? osd_device_fini+0x147/0x190 [osd_ldiskfs] <4> [<ffffffffa0509bf7>] ? class_cleanup+0x577/0xda0 [obdclass] <4> [<ffffffffa04deb36>] ? class_name2dev+0x56/0xe0 [obdclass] <4> [<ffffffffa050b4dc>] ? class_process_config+0x10bc/0x1c80 [obdclass] <4> [<ffffffffa0504d03>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] <4> [<ffffffffa050c219>] ? class_manual_cleanup+0x179/0x6f0 [obdclass] <4> [<ffffffffa03a3717>] ? cfs_waitq_broadcast+0x17/0x20 [libcfs] <4> [<ffffffffa04dced2>] ? class_export_put+0x102/0x2c0 [obdclass] <4> [<ffffffffa0c87655>] ? osd_obd_disconnect+0x1c5/0x1d0 [osd_ldiskfs] <4> [<ffffffffa050e2ae>] ? lustre_put_lsi+0x17e/0x1100 [obdclass] <4> [<ffffffffa0516ff8>] ? lustre_common_put_super+0x5f8/0xc40 [obdclass] <4> [<ffffffffa0540b7a>] ? server_put_super+0x1ca/0xf00 [obdclass] <4> [<ffffffff8118363b>] ? generic_shutdown_super+0x5b/0xe0 <4> [<ffffffff81183726>] ? kill_anon_super+0x16/0x60 <4> [<ffffffffa050e0d6>] ? lustre_kill_super+0x36/0x60 [obdclass] <4> [<ffffffff81183ec7>] ? deactivate_super+0x57/0x80 <4> [<ffffffff811a21bf>] ? mntput_no_expire+0xbf/0x110 <4> [<ffffffff811a2c2b>] ? sys_umount+0x7b/0x3a0 <4> [<ffffffff810864b1>] ? sigprocmask+0x71/0x110 <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Which clearly indicate that a shutdown was started and MDT being umounted, and this may be since quite a long time : <4>runnable tasks: <4> task PID tree-key switches prio exec-runtime sum-exec sum-sleep <4>---------------------------------------------------------------------------------------------------------- <4>R umount 32242 291143865.264277 402 120 291143865.264277 33752.842001 36352.713363 / <4> Last, about the failed crash-dump attempt via Alt+SysReq+c, is kdump correctly installed+configured ? |
| Comment by Daire Byrne (Inactive) [ 22/Jul/14 ] |
|
Yea it looks like the dump isn't any good. Basically the console was unresponsive so we had to power cycle and so the dump is actually from when the machine was shutting down. That is probably why the MDT threads are idle and the kdump was incomplete. We will have to try again next time we see this. The frequency seems to have dropped to around once a month now. |
| Comment by Daire Byrne (Inactive) [ 28/Aug/14 ] |
|
We had another instance of this. Couldn't login to the console etc. Managed to sysrq crash the system. I've attached the vmcore-dmesg.txt but I'm not sure if there is anything too interesting in it. |
| Comment by Bruno Faccini (Inactive) [ 29/Aug/14 ] |
|
Thanks Daire, there are interesting infos in this new vmcore-dmesg.txt log! |
| Comment by Daire Byrne (Inactive) [ 01/Sep/14 ] |
|
Bruno, I've uploaded the (5GB) vmcore and syslog messages to dropbox - https://db.tt/qJpidSFG . I hope it is of use to you. |
| Comment by Bruno Faccini (Inactive) [ 17/Sep/14 ] |
|
Hello Daire, And thus seems to me that we face a kernel/vm weakness (may be tuning?) issue during memory heavy pressure (on a server/MDS with 360GB memory!), causing a contention on zone->lru_lock (and associated mm/vm structs/lists) and thus high response delay of all threads/processes including MDS ones. So my understanding is that, due to what I just described before, the whole Node/Server operations are really slow including MDS ones. |
| Comment by Daire Byrne (Inactive) [ 18/Sep/14 ] |
|
Okay, cheers. Well we don't make any changes of note to the default EL6 tunings. The only things I can see that may be related are: # jb: pdflush tunables. linuxnfs have 384GB ram so flush to disk fast
vm.dirty_background_bytes = 2147483648
vm.dirty_bytes = 4294967296
vm.dirty_expire_centisecs = 500
vm.min_free_kbytes = 2097152
# daire: allow some swapping (default=40)
vm.swappiness = 20
But I can't imagine any of these would be relevant. It is true that the whole server grinds to a halt but I guess we are no closer to figuring out what causes it. |
| Comment by Bruno Faccini (Inactive) [ 18/Sep/14 ] |
|
I think the best fix will be to prevent MDS node to enter the vm/memory-reclaim related code that seems not very good to handle the case on a huge memory node. I have made an audit of memory usage in crash-dump and here are the significant points: Why don't you tune/limit this MDS to not let buffers to fill all memory and leave enough room for MDS processes/threads and slabs without the need of memory-reclaim ? |
| Comment by Daire Byrne (Inactive) [ 18/Sep/14 ] |
|
Okay I will look into limiting the buffer size (/proc/sys/vm/swappiness and/or vm.min_free_kbytes?). The main reason so many inodes end up in buffer is that we use this filesystem to backup all our production storage so a full scan occurs every night across ~157987516 files. |
| Comment by Bruno Faccini (Inactive) [ 19/Sep/14 ] |
|
I think in your particular config (huge memory node!) and usage (full scan of FS causing a lot of MDT/meta-datas blocks to fill buffer-cache) case the best parameter to address it and to be changed would be to set swappiness higher, and even to 100 (%). |
| Comment by Daire Byrne (Inactive) [ 13/Oct/14 ] |
|
So it happened again last night after a couple of weeks with swapiness=90. I'll try setting it to 100 but I can't really see much difference in the memory usage graphs. I've attached the messages from overnight. |
| Comment by Daire Byrne (Inactive) [ 13/Oct/14 ] |
|
Maybe this is related to |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/14 ] |
|
Hello Daire, Woud be cool if you could provide me with at least the "cat /proc/meminfo" output just before you run the "echo 1 > /proc/sys/vm/drop_caches" cmd , and againthe "cat /proc/meminfo" output just after. |
| Comment by Daire Byrne (Inactive) [ 28/Oct/14 ] |
|
Bruno, I have actually setup a cron job the "echo 3 > /proc/sys/vm/drop_caches" every Saturday at midnight (graph included). As you can see the memory quickly gets filled again on these busy filesystems (our backups also start at midnight). I've attached the meminfo before and after dropping caches as you requested (also seen at the end of the attached graph). We are currently running with swappiness=100. I'll add the extra_free_kbytes setting as well. All we can do is wait a few weeks to see if it falls over again. |
| Comment by Daire Byrne (Inactive) [ 03/Dec/14 ] |
|
Just to update - we have not seen this memory pressure crash in 51 days. The kswapd constantly uses 50-100% CPU but it that helps keep the server running then so be it. # free
total used free shared buffers cached
Mem: 396857076 366020780 30836296 0 345853496 12948
-/+ buffers/cache: 20154336 376702740
Swap: 16777208 763064 16014144
top - 11:59:34 up 51 days, 3:24, 1 user, load average: 2.43, 2.27, 2.75
Tasks: 1718 total, 2 running, 1716 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 5.0%sy, 0.0%ni, 87.6%id, 7.2%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 396857076k total, 366033176k used, 30823900k free, 345864496k buffers
Swap: 16777208k total, 763100k used, 16014108k free, 16216k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
499 root 20 0 0 0 0 D 81.0 0.0 6426:53 kswapd1
5061 root 20 0 0 0 0 S 9.0 0.0 6033:55 socknal_sd03_00
6836 root 20 0 0 0 0 S 9.0 0.0 615:03.16 mdt07_016
6747 root 20 0 0 0 0 D 5.4 0.0 405:22.90 mdt03_043
6833 root 20 0 0 0 0 S 5.4 0.0 585:46.06 mdt07_013
16840 root 20 0 27136 2708 1076 R 5.4 0.0 0:00.06 top
17061 root 20 0 0 0 0 S 5.4 0.0 563:14.19 mdt00_050
21161 root 20 0 0 0 0 S 5.4 0.0 476:50.74 mdt03_039
vm.min_free_kbytes = 4194304
vm.extra_free_kbytes = 4194304
vm.vfs_cache_pressure = 100
vm.swappiness = 100
00 23 * * 0 echo 3 > /proc/sys/vm/drop_caches
|
| Comment by Bruno Faccini (Inactive) [ 04/Dec/14 ] |
|
Hello Daire,, |
| Comment by Daire Byrne (Inactive) [ 23/Jan/15 ] |
|
Still no recurrence of this since making the sysctl tweaks. You can close it now. Cheers. |
| Comment by Peter Jones [ 23/Jan/15 ] |
|
That's good news - thanks Daire! |
| Comment by Daire Byrne (Inactive) [ 16/Mar/15 ] |
|
We may have just seen this again after many months and swapiness=100. Same symptoms - the MDS becomes unresponsive and requires a hard reboot. I've attached the messages - if it looks like a different issue I can open a new ticket. Cheers. |
| Comment by Daire Byrne (Inactive) [ 16/Mar/15 ] |
|
It may still be us. Apparently we are setting swappiness=20 and then swappiness=100 directly afterwards every hour. Don't you love config management systems.... |
| Comment by Bruno Faccini (Inactive) [ 24/Mar/15 ] |
|
Hello Daire, |
| Comment by Daire Byrne (Inactive) [ 24/Jun/15 ] |
|
Sorry, I forgot about this ticket. You can close it again now - we have fixed our config management system. |
| Comment by Peter Jones [ 24/Jun/15 ] |
|
ok - thanks Daire |
| Comment by Daire Byrne (Inactive) [ 07/Dec/15 ] |
|
Sorry... this is still an issue (running out of memory) - we just have to auto reboot the server once every month. We recently updated to v2.5.3 and it seems like the frequency of the MDS crashes has increased to weekly reboots. We are still running with: vm.min_free_kbytes = 4194304 vm.extra_free_kbytes = 4194304 vm.swappiness = 100 Is there anything else we can try to avoid running out of memory so frequently? I'll try and get another sysreq dump of this with v.2.5.3 but it's tricky with such a large memory node. |
| Comment by Bruno Faccini (Inactive) [ 07/Dec/15 ] |
|
Hello Daire, |
| Comment by Daire Byrne (Inactive) [ 07/Dec/15 ] |
More like we detect the memory problem first and then auto reboot the server (if free<10G). It used to happened once a month give or take a week. The load on these servers is pretty consistent from day to day.
No not yet, I have instructed our engineers to do this next time and I have disabled the automatic reboot so they can maybe sysreq it.
I will endeavor to get that information to you in a few days time. In the meantime, here is the output from the MDS server after only 1 day of uptime since the last incident. bmds1 / # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 4982 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 17497 MB
node distances:
node 0 1
0: 10 20
1: 20 10
bmds1 / # cat /proc/zoneinfo
Node 0, zone DMA
pages free 3934
min 60
low 135
high 150
scanned 0
spanned 4095
present 3835
nr_free_pages 3934
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 1
nr_anon_transparent_hugepages 0
protection: (0, 1931, 129191, 129191)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 24
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 25
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 26
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 27
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 28
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 29
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 30
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 31
count: 0
high: 0
batch: 1
vm stats threshold: 12
all_unreclaimable: 1
prev_priority: 12
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 110046
min 7833
low 17624
high 19582
scanned 0
spanned 1044480
present 494376
nr_free_pages 110046
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 2029
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 2017
numa_other 12
nr_anon_transparent_hugepages 0
protection: (0, 0, 127260, 127260)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 4
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 6
count: 30
high: 186
batch: 31
vm stats threshold: 60
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 8
count: 1
high: 186
batch: 31
vm stats threshold: 60
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 12
count: 1
high: 186
batch: 31
vm stats threshold: 60
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 14
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 28
count: 30
high: 186
batch: 31
vm stats threshold: 60
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 60
all_unreclaimable: 1
prev_priority: 12
start_pfn: 4096
inactive_ratio: 3
Node 0, zone Normal
pages free 1161377
min 516243
low 1161546
high 1290607
scanned 0
spanned 33030144
present 32578560
nr_free_pages 1161377
nr_inactive_anon 16264
nr_active_anon 4646
nr_inactive_file 13336802
nr_active_file 13708442
nr_unevictable 0
nr_mlock 0
nr_anon_pages 18320
nr_mapped 3845
nr_file_pages 27045274
nr_dirty 12702
nr_writeback 0
nr_slab_reclaimable 3244308
nr_slab_unreclaimable 1087317
nr_page_table_pages 1220
nr_kernel_stack 597
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 30
numa_hit 7687506784
numa_miss 25118087
numa_foreign 3245810767
numa_interleave 60017
numa_local 7687483560
numa_other 25141311
nr_anon_transparent_hugepages 5
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 87
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 100
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 186
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 122
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 158
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 163
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 182
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 161
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 86
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 61
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 169
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 54
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 2
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 128
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 3
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 142
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 44
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 165
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 148
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 172
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 170
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 1048576
inactive_ratio: 35
Node 1, zone Normal
pages free 4336363
min 524437
low 1179983
high 1311092
scanned 0
spanned 33554432
present 33095680
nr_free_pages 4336363
nr_inactive_anon 15126
nr_active_anon 2017
nr_inactive_file 12266036
nr_active_file 12800791
nr_unevictable 0
nr_mlock 0
nr_anon_pages 14012
nr_mapped 2008
nr_file_pages 25067210
nr_dirty 127401
nr_writeback 0
nr_slab_reclaimable 2577430
nr_slab_unreclaimable 876760
nr_page_table_pages 1226
nr_kernel_stack 957
nr_unstable 0
nr_bounce 0
nr_vmscan_write 25
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 34
numa_hit 3879043142
numa_miss 3245810767
numa_foreign 25118087
numa_interleave 60079
numa_local 3878729990
numa_other 3246123919
nr_anon_transparent_hugepages 6
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 183
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 185
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 46
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 159
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 35
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 27
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 204
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 194
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 24
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 101
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 152
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 59
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 34
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 158
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 32
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 129
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 32
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 189
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 30
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 165
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 33
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 43
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 183
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 45
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 155
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 130
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 41
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 178
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 167
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 122
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 184
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 163
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 34078720
inactive_ratio: 35
|
| Comment by Bruno Faccini (Inactive) [ 09/Dec/15 ] |
|
Well, it is a bit strange that processes/threads running on Node 1 are heavily trying to allocate on Node 0 ... |
| Comment by Daire Byrne (Inactive) [ 09/Dec/15 ] |
|
We don't have any special ptlrpcd configuration. Here's another grab of the numactl and zoneinfo output (after 3 days of uptime now) + the requested lspci -tv bmds1 /root # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 4981 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 16123 MB
node distances:
node 0 1
0: 10 20
1: 20 10
bmds1 /root # cat /proc/zoneinfo
Node 0, zone DMA
pages free 3934
min 60
low 135
high 150
scanned 0
spanned 4095
present 3835
nr_free_pages 3934
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 1
nr_anon_transparent_hugepages 0
protection: (0, 1931, 129191, 129191)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 24
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 25
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 26
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 27
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 28
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 29
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 30
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 31
count: 0
high: 0
batch: 1
vm stats threshold: 12
all_unreclaimable: 1
prev_priority: 12
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 110077
min 7833
low 17624
high 19582
scanned 0
spanned 1044480
present 494376
nr_free_pages 110077
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 3722
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 3705
numa_other 17
nr_anon_transparent_hugepages 0
protection: (0, 0, 127260, 127260)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 4
count: 1
high: 186
batch: 31
vm stats threshold: 60
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 6
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 12
count: 30
high: 186
batch: 31
vm stats threshold: 60
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 14
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 28
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 60
all_unreclaimable: 1
prev_priority: 12
start_pfn: 4096
inactive_ratio: 3
Node 0, zone Normal
pages free 1161500
min 516243
low 1161546
high 1290607
scanned 0
spanned 33030144
present 32578560
nr_free_pages 1161500
nr_inactive_anon 16343
nr_active_anon 4794
nr_inactive_file 13047605
nr_active_file 13835116
nr_unevictable 0
nr_mlock 0
nr_anon_pages 19541
nr_mapped 1956
nr_file_pages 26882808
nr_dirty 16798
nr_writeback 0
nr_slab_reclaimable 3075559
nr_slab_unreclaimable 1416404
nr_page_table_pages 1319
nr_kernel_stack 601
nr_unstable 0
nr_bounce 0
nr_vmscan_write 3854
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 31
numa_hit 24437944447
numa_miss 201935750
numa_foreign 9386764171
numa_interleave 60017
numa_local 24437875632
numa_other 202004565
nr_anon_transparent_hugepages 3
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 152
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 119
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 214
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 148
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 168
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 127
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 56
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 137
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 236
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 179
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 187
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 11
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 61
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 16
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 137
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 62
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 145
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 101
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 130
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 183
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 116
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 187
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 185
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 222
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 157
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 1048576
inactive_ratio: 35
Node 1, zone Normal
pages free 3876044
min 524437
low 1179983
high 1311092
scanned 0
spanned 33554432
present 33095680
nr_free_pages 3876044
nr_inactive_anon 16221
nr_active_anon 564
nr_inactive_file 12740495
nr_active_file 13455109
nr_unevictable 0
nr_mlock 0
nr_anon_pages 13551
nr_mapped 2654
nr_file_pages 26198065
nr_dirty 141941
nr_writeback 0
nr_slab_reclaimable 2174549
nr_slab_unreclaimable 606113
nr_page_table_pages 1296
nr_kernel_stack 975
nr_unstable 0
nr_bounce 0
nr_vmscan_write 890
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 32
numa_hit 10920091963
numa_miss 9386764171
numa_foreign 201935750
numa_interleave 60079
numa_local 10919375847
numa_other 9387480287
nr_anon_transparent_hugepages 6
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 193
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 195
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 51
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 124
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 71
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 226
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 82
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 203
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 73
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 158
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 168
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 114
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 56
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 54
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 170
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 149
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 231
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 199
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 54
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 38
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 78
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 206
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 57
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 62
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 214
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 96
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 171
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 77
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 186
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 179
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 180
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 82
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 34078720
inactive_ratio: 35
bmds1 /root # lspci -tv
-+-[0000:7f]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
| +-09.0 Intel Corporation Xeon E5/Core i7 QPI Link 1
| +-0a.0 Intel Corporation Xeon E5/Core i7 Power Control Unit 0
| +-0a.1 Intel Corporation Xeon E5/Core i7 Power Control Unit 1
| +-0a.2 Intel Corporation Xeon E5/Core i7 Power Control Unit 2
| +-0a.3 Intel Corporation Xeon E5/Core i7 Power Control Unit 3
| +-0b.0 Intel Corporation Xeon E5/Core i7 Interrupt Control Registers
| +-0b.3 Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers
| +-0c.0 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.1 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.2 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.3 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0
| +-0c.7 Intel Corporation Xeon E5/Core i7 System Address Decoder
| +-0d.0 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.1 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.2 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.3 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1
| +-0e.0 Intel Corporation Xeon E5/Core i7 Processor Home Agent
| +-0e.1 Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring
| +-0f.0 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers
| +-0f.1 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers
| +-0f.2 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0
| +-0f.3 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1
| +-0f.4 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2
| +-0f.5 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3
| +-0f.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4
| +-10.0 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0
| +-10.1 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1
| +-10.2 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0
| +-10.3 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1
| +-10.4 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2
| +-10.5 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3
| +-10.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2
| +-10.7 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3
| +-11.0 Intel Corporation Xeon E5/Core i7 DDRIO
| +-13.0 Intel Corporation Xeon E5/Core i7 R2PCIe
| +-13.1 Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor
| +-13.4 Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers
| +-13.5 Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor
| \-13.6 Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor
+-[0000:40]-+-01.0-[41]----00.0 LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
| +-03.0-[42]----00.0 LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2
| +-05.0 Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
| \-05.2 Intel Corporation Xeon E5/Core i7 Control Status and Global Errors
+-[0000:3f]-+-08.0 Intel Corporation Xeon E5/Core i7 QPI Link 0
| +-09.0 Intel Corporation Xeon E5/Core i7 QPI Link 1
| +-0a.0 Intel Corporation Xeon E5/Core i7 Power Control Unit 0
| +-0a.1 Intel Corporation Xeon E5/Core i7 Power Control Unit 1
| +-0a.2 Intel Corporation Xeon E5/Core i7 Power Control Unit 2
| +-0a.3 Intel Corporation Xeon E5/Core i7 Power Control Unit 3
| +-0b.0 Intel Corporation Xeon E5/Core i7 Interrupt Control Registers
| +-0b.3 Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers
| +-0c.0 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.1 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.2 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.3 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0c.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0
| +-0c.7 Intel Corporation Xeon E5/Core i7 System Address Decoder
| +-0d.0 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.1 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.2 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.3 Intel Corporation Xeon E5/Core i7 Unicast Register 0
| +-0d.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1
| +-0e.0 Intel Corporation Xeon E5/Core i7 Processor Home Agent
| +-0e.1 Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring
| +-0f.0 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers
| +-0f.1 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers
| +-0f.2 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0
| +-0f.3 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1
| +-0f.4 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2
| +-0f.5 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3
| +-0f.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4
| +-10.0 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0
| +-10.1 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1
| +-10.2 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0
| +-10.3 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1
| +-10.4 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2
| +-10.5 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3
| +-10.6 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2
| +-10.7 Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3
| +-11.0 Intel Corporation Xeon E5/Core i7 DDRIO
| +-13.0 Intel Corporation Xeon E5/Core i7 R2PCIe
| +-13.1 Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor
| +-13.4 Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers
| +-13.5 Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor
| \-13.6 Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor
\-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
+-01.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
| +-00.1 Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
| +-00.2 Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
| \-00.3 Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
+-02.0-[03]--
+-02.2-[02]----00.0 LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt]
+-03.0-[04]----00.0 Virident Systems Inc. Virident FlashMAX Drive V2
+-05.0 Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
+-05.2 Intel Corporation Xeon E5/Core i7 Control Status and Global Errors
+-11.0-[05]--
+-16.0 Intel Corporation C600/X79 series chipset MEI Controller #1
+-16.1 Intel Corporation C600/X79 series chipset MEI Controller #2
+-1a.0 Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2
+-1c.0-[06]--
+-1c.7-[07-0b]----00.0-[08-0b]--+-00.0-[09-0a]----00.0-[0a]----00.0 Matrox Electronics Systems Ltd. G200eR2
| \-01.0-[0b]--
+-1d.0 Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1
+-1e.0-[0c]--
+-1f.0 Intel Corporation C600/X79 series chipset LPC Controller
\-1f.2 Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller
|
| Comment by Daire Byrne (Inactive) [ 14/Dec/15 ] |
|
After 8 days uptime - still waiting for the inevitable crash... bmds1 # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 8025 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 17273 MB
node distances:
node 0 1
0: 10 20
1: 20 10
bmds1 /root # cat /proc/zoneinfo
Node 0, zone DMA
pages free 3934
min 60
low 135
high 150
scanned 0
spanned 4095
present 3835
nr_free_pages 3934
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 1
nr_anon_transparent_hugepages 0
protection: (0, 1931, 129191, 129191)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 24
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 25
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 26
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 27
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 28
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 29
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 30
count: 0
high: 0
batch: 1
vm stats threshold: 12
cpu: 31
count: 0
high: 0
batch: 1
vm stats threshold: 12
all_unreclaimable: 1
prev_priority: 12
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 110046
min 7833
low 17624
high 19582
scanned 0
spanned 1044480
present 494376
nr_free_pages 110046
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 9738
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 9695
numa_other 43
nr_anon_transparent_hugepages 0
protection: (0, 0, 127260, 127260)
pagesets
cpu: 0
count: 30
high: 186
batch: 31
vm stats threshold: 60
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 4
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 6
count: 1
high: 186
batch: 31
vm stats threshold: 60
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 8
count: 30
high: 186
batch: 31
vm stats threshold: 60
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 12
count: 1
high: 186
batch: 31
vm stats threshold: 60
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 14
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 28
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 60
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 60
all_unreclaimable: 1
prev_priority: 12
start_pfn: 4096
inactive_ratio: 3
Node 0, zone Normal
pages free 2036613
min 516243
low 1161546
high 1290607
scanned 0
spanned 33030144
present 32578560
nr_free_pages 2036613
nr_inactive_anon 4511
nr_active_anon 262
nr_inactive_file 14046304
nr_active_file 14206134
nr_unevictable 0
nr_mlock 0
nr_anon_pages 2802
nr_mapped 728
nr_file_pages 28253303
nr_dirty 96376
nr_writeback 0
nr_slab_reclaimable 1740250
nr_slab_unreclaimable 518722
nr_page_table_pages 1323
nr_kernel_stack 619
nr_unstable 0
nr_bounce 0
nr_vmscan_write 486599
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 1
numa_hit 71200515152
numa_miss 2298983520
numa_foreign 18750545338
numa_interleave 60017
numa_local 71200327746
numa_other 2299170926
nr_anon_transparent_hugepages 3
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 148
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 158
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 75
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 174
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 162
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 181
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 173
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 178
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 142
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 183
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 175
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 214
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 1
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 49
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 164
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 167
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 174
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 113
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 176
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 166
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 165
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 60
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 181
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 165
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 184
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 2
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 1048576
inactive_ratio: 35
Node 1, zone Normal
pages free 6797220
min 524437
low 1179983
high 1311092
scanned 0
spanned 33554432
present 33095680
nr_free_pages 6797220
nr_inactive_anon 10776
nr_active_anon 1964
nr_inactive_file 12050770
nr_active_file 12364474
nr_unevictable 0
nr_mlock 0
nr_anon_pages 10537
nr_mapped 3388
nr_file_pages 24416518
nr_dirty 41224
nr_writeback 0
nr_slab_reclaimable 1120374
nr_slab_unreclaimable 531249
nr_page_table_pages 1346
nr_kernel_stack 996
nr_unstable 0
nr_bounce 0
nr_vmscan_write 8957
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 1
numa_hit 27356716669
numa_miss 18750545338
numa_foreign 2298983520
numa_interleave 60079
numa_local 27355207924
numa_other 18752054083
nr_anon_transparent_hugepages 3
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 158
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 135
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 159
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 184
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 155
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 173
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 185
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 39
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 139
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 118
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 171
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 138
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 175
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 69
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 173
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 180
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 180
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 7
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 185
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 28
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 169
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 200
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 153
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 28
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 182
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 139
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 131
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 60
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 173
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 160
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 154
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 102
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 34078720
inactive_ratio: 35
|
| Comment by Bruno Faccini (Inactive) [ 15/Dec/15 ] |
|
Well, according the evolution of the memory counters/stats, the free memory thresholds seem to work ok. So you may encounter a very special situation at the time of the OOMs. Hope that we will be able to learn again from a crash-dump. |
| Comment by Daire Byrne (Inactive) [ 04/Jan/16 ] |
|
The server became pretty unresponsive this morning. I tried the crash dump but kdump wasn't configured properly to catch it For what it's worth I grabbed another zoneinfo while in this state (<8G free). Trying to drop caches had essentially no effect and didn't free up any memory. bmds1 /root # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 2463 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 2049 MB
node distances:
node 0 1
0: 10 20
1: 20 10
bmds1 /root # cat /proc/zoneinfo
Node 0, zone DMA
pages free 3934
min 60
low 135
high 150
scanned 0
spanned 4095
present 3835
nr_free_pages 3934
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 1
nr_anon_transparent_hugepages 0
protection: (0, 1931, 129191, 129191)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 24
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 25
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 26
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 27
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 28
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 29
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 30
count: 0
high: 0
batch: 1
vm stats threshold: 2
cpu: 31
count: 0
high: 0
batch: 1
vm stats threshold: 2
all_unreclaimable: 1
prev_priority: 12
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 110108
min 7833
low 17624
high 19582
scanned 0
spanned 1044480
present 494376
nr_free_pages 110108
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 29013
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 28306
numa_other 707
nr_anon_transparent_hugepages 0
protection: (0, 0, 127260, 127260)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096
inactive_ratio: 3
Node 0, zone Normal
pages free 517075
min 516243
low 1161546
high 1290607
scanned 288
spanned 33030144
present 32578560
nr_free_pages 517075
nr_inactive_anon 107
nr_active_anon 97
nr_inactive_file 14924374
nr_active_file 14936877
nr_unevictable 0
nr_mlock 0
nr_anon_pages 184
nr_mapped 378
nr_file_pages 29861476
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 1493381
nr_slab_unreclaimable 681908
nr_page_table_pages 1240
nr_kernel_stack 624
nr_unstable 0
nr_bounce 0
nr_vmscan_write 42937040
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
numa_hit 156088061106
numa_miss 4460728532
numa_foreign 150915871653
numa_interleave 60017
numa_local 156087747488
numa_other 4461042150
nr_anon_transparent_hugepages 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 30
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 1048576
inactive_ratio: 35
Node 1, zone Normal
pages free 525475
min 524437
low 1179983
high 1311092
scanned 2624
spanned 33554432
present 33095680
nr_free_pages 525475
nr_inactive_anon 2427
nr_active_anon 2424
nr_inactive_file 15429332
nr_active_file 15429025
nr_unevictable 0
nr_mlock 0
nr_anon_pages 4709
nr_mapped 1574
nr_file_pages 30859896
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 970998
nr_slab_unreclaimable 523574
nr_page_table_pages 1526
nr_kernel_stack 1082
nr_unstable 0
nr_bounce 0
nr_vmscan_write 156051144
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 1
numa_hit 88088497663
numa_miss 150915871653
numa_foreign 4460728532
numa_interleave 60079
numa_local 88081349155
numa_other 150923020161
nr_anon_transparent_hugepages 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 1
count: 27
high: 186
batch: 31
vm stats threshold: 125
cpu: 2
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 3
count: 1
high: 186
batch: 31
vm stats threshold: 125
cpu: 4
count: 1
high: 186
batch: 31
vm stats threshold: 125
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 6
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 7
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 9
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 12
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 13
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 14
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 15
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 16
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 18
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 24
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 25
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 26
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 27
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 28
count: 2
high: 186
batch: 31
vm stats threshold: 125
cpu: 29
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 30
count: 0
high: 186
batch: 31
vm stats threshold: 125
cpu: 31
count: 0
high: 186
batch: 31
vm stats threshold: 125
all_unreclaimable: 0
prev_priority: 12
start_pfn: 34078720
inactive_ratio: 35
|
| Comment by Bruno Faccini (Inactive) [ 07/Jan/16 ] |
|
Hello Daire, |
| Comment by Daire Byrne (Inactive) [ 09/Feb/16 ] |
|
Sorry for the lase reply, we have been struggling to get a useful vmcore. We trimmed out all the cache and buffers from the kdump but it still won't fit on the root drive (400G RAM). But this actually highlighted something I don't fully understand. When the server is getting close to falling over (when less than ~10G free), running "sync ; echo 3 > /proc/sys/vm/drop_caches" does not seem to free up any memory at all. A server that has been up for a couple of days will at least free up half the 400G memory after issuing that command. The actual server node becomes sluggish. It takes ages to ssh in and the console is extremely slow. There isn't anything else useful in the logs other than the resulting Lustre timeouts. And yes, we are stll using "vm.[min,extra]_free_kbytes = 4194304" but we know the server is getting close to crashing when there is less than ~10G free. I'll try to get the thread stacks next time too. |
| Comment by Daire Byrne (Inactive) [ 26/Jan/17 ] |
|
Just to follow up on this - we upgraded from v2.4.1 -> v2.8 a couple of months ago and have not seen this issue since (with the same workload). We also took the opportunity to upgrade from RHEL6 -> RHEL7 at the same time. So one or other of these resolved the issue. Before the upgrades, we still suffered from this problem once a week but simply rebooted when it occurred. |
| Comment by Peter Jones [ 26/Jan/17 ] |
|
Glad to hear it. Thanks for the update Daire. |