[LU-5255] MDS grinds to halt - reboot required Created: 25/Jun/14  Updated: 26/Jan/17  Resolved: 26/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: PNG File cmds1-mem.png     PNG File cmds1-ram.png     File meminfo.after     File meminfo.before     HTML File messages     File messages.cmds1     Text File vmcore-dmesg.txt     File vmcore-dmesg.txt.gz    
Severity: 3
Rank (Obsolete): 14663

 Description   

We have hit this at least 3 times in as many weeks on our MDS server. A reboot is required to bring the filesystem back to life. Here is the relevant part of /var/log/messages

Jun 25 13:55:53 bmds1 kernel: LNet: Service thread pid 22737 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 13:55:53 bmds1 kernel: Pid: 22737, comm: mdt00_021
Jun 25 13:55:53 bmds1 kernel: 
Jun 25 13:55:53 bmds1 kernel: Call Trace:
Jun 25 13:55:53 bmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
Jun 25 13:55:53 bmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e66d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06be01d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df04f8>] mdt_object_open_lock+0x1c8/0x510 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd4bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df56b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0df629b>] mdt_reint_open+0x56b/0x20c0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa040282e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e6dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa057df50>] ? lu_ucred+0x20/0x30 [obdclass]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1015>] ? mdt_ucred+0x15/0x20 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0ddd15c>] ? mdt_root_squash+0x2c/0x410 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa057df50>] ? lu_ucred+0x20/0x30 [obdclass]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0de1911>] mdt_reint_rec+0x41/0xe0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc6ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc706d>] mdt_intent_reint+0x1ed/0x520 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 13:55:53 bmds1 kernel: 
Jun 25 13:55:53 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403700953.22737
Jun 25 13:55:53 bmds1 kernel: Lustre: lock timed out (enqueued at 1403700753, 200s ago)
Jun 25 13:55:53 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403700953.6869
Jun 25 13:55:53 bmds1 kernel: LNet: Service thread pid 6869 was inactive for 200.26s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 13:55:53 bmds1 kernel: Pid: 6869, comm: mdt01_003
Jun 25 13:55:53 bmds1 kernel: 
Jun 25 13:55:53 bmds1 kernel: Call Trace:
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e676e>] ? cfs_waitq_del+0xe/0x10 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:55:53 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 13:55:53 bmds1 kernel: 
Jun 25 13:56:43 bmds1 kernel: LNet: Service thread pid 22446 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 13:56:43 bmds1 kernel: Pid: 22446, comm: mdt02_016
Jun 25 13:56:43 bmds1 kernel: 
Jun 25 13:56:43 bmds1 kernel: Call Trace:
Jun 25 13:56:43 bmds1 kernel: Lustre: lock timed out (enqueued at 1403700803, 200s ago)
Jun 25 13:56:43 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06b7643>] ? ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03e66fe>] ? cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06be0aa>] ? ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bd758>] ? ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc7c7b>] ? mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc84f4>] ? mdt_object_lock+0x14/0x20 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dd75a9>] ? mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dd83ad>] ? mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc4f1e>] ? mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa069e831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06c51ef>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dc53a6>] ? mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0dcba97>] ? mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa0e053f5>] ? mds_regular_handle+0x15/0x20 [mdt]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f73c8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f875e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 13:56:43 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 13:56:43 bmds1 kernel: 
Jun 25 13:56:43 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403701003.22446
Jun 25 14:06:00 bmds1 kernel: Lustre: 22504:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
Jun 25 14:06:00 bmds1 kernel:  req@ffff88217d305400 x1468581949625152/t0(0) o101->0bb08de7-0631-0e52-3ff1-946375c89d85@10.21.22.25@tcp:0/0 lens 576/3448 e 5 to 0 dl 1403701565 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:06:00 bmds1 kernel: Lustre: 22402:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
Jun 25 14:06:00 bmds1 kernel:  req@ffff882d71613800 x1468571541943816/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/1152 e 5 to 0 dl 1403701565 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:06:50 bmds1 kernel: Lustre: 6920:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
Jun 25 14:06:50 bmds1 kernel:  req@ffff8833cbe53c00 x1468568450342728/t0(0) o101->b69b8925-7476-e90c-d5d5-edb6b51853ec@10.21.22.24@tcp:0/0 lens 608/3448 e 5 to 0 dl 1403701615 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 14:07:56 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:08:21 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:08:21 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:08:46 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:08:46 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:08:46 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:08:46 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:09:11 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:09:11 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:09:11 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:09:11 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:09:36 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:09:36 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:09:36 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:09:36 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:09:44 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 10.21.22.25@tcp  ns: mdt-bravo-MDT0000_UUID lock: ffff881d38f2f240/0x6ae950
Jun 25 14:10:01 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:10:01 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:10:01 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:10:01 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:10:26 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:10:26 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:10:26 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:10:26 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:11:16 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:11:16 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:11:16 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:11:16 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:11:46 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 2200s: evicting client at 10.21.22.23@tcp  ns: mdt-bravo-MDT0000_UUID lock: ffff881656d94480/0x6ae90
Jun 25 14:13:23 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701803, 200s ago)
Jun 25 14:15:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701926, 200s ago)
Jun 25 14:16:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403701986, 200s ago)
Jun 25 14:16:52 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702012, 200s ago)
Jun 25 14:18:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702106, 200s ago)
Jun 25 14:19:26 bmds1 kernel: LNet: Service thread pid 6870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:19:26 bmds1 kernel: Pid: 6870, comm: mdt00_003
Jun 25 14:19:26 bmds1 kernel: 
Jun 25 14:19:26 bmds1 kernel: Call Trace:
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06b7643>] ? ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
Jun 25 14:19:26 bmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03e66d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06be01d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:19:26 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:19:26 bmds1 kernel: 
Jun 25 14:19:26 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702366.6870
Jun 25 14:19:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702166, 200s ago)
Jun 25 14:21:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702286, 200s ago)
Jun 25 14:22:27 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702347, 200s ago)
Jun 25 14:22:28 bmds1 kernel: LNet: Service thread pid 22401 was inactive for 262.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:22:28 bmds1 kernel: Pid: 22401, comm: mdt00_014
Jun 25 14:22:28 bmds1 kernel: 
Jun 25 14:22:28 bmds1 kernel: Call Trace:
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:22:28 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:22:28 bmds1 kernel: 
Jun 25 14:22:28 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702548.22401
Jun 25 14:22:33 bmds1 kernel: Lustre: 6944:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:22:33 bmds1 kernel:  req@ffff881dbaa01800 x1468581963167316/t0(0) o101->0bb08de7-0631-0e52-3ff1-946375c89d85@10.21.22.25@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403702558 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:22:39 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 14:22:39 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:22:39 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:22:39 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:23:04 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 14:23:04 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:23:29 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 14:23:29 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:23:29 bmds1 kernel: LNet: Service thread pid 6908 was inactive for 262.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:23:29 bmds1 kernel: Pid: 6908, comm: mdt00_006
Jun 25 14:23:29 bmds1 kernel: 
Jun 25 14:23:29 bmds1 kernel: Call Trace:
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:23:29 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:23:29 bmds1 kernel: 
Jun 25 14:23:29 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702609.6908
Jun 25 14:24:19 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 14:24:19 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:24:19 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:24:19 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:24:24 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 108s: evicting client at 10.21.22.25@tcp  ns: mdt-bravo-MDT0000_UUID lock: ffff882efc845000/0x6ae950
Jun 25 14:24:24 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message
Jun 25 14:24:26 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702466, 200s ago)
Jun 25 14:24:36 bmds1 kernel: Lustre: 22403:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:24:36 bmds1 kernel:  req@ffff880d89de9000 x1468571551937816/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702681 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:25:32 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:25:32 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:25:32 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 8 active RPCs
Jun 25 14:25:32 bmds1 kernel: Lustre: Skipped 2 previous similar messages
Jun 25 14:25:36 bmds1 kernel: Lustre: 30120:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:25:36 bmds1 kernel:  req@ffff88204db56800 x1468571552939332/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702741 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:26:02 bmds1 kernel: Lustre: 22450:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:26:02 bmds1 kernel:  req@ffff88404d9c8400 x1468568453283516/t0(0) o101->b69b8925-7476-e90c-d5d5-edb6b51853ec@10.21.22.24@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403702767 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:26:27 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 711s: evicting client at 10.21.22.23@tcp  ns: mdt-bravo-MDT0000_UUID lock: ffff88022e542900/0x6ae950
Jun 25 14:27:36 bmds1 kernel: Lustre: 30242:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:27:36 bmds1 kernel:  req@ffff8815c622b400 x1468571555018544/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 0 to 0 dl 1403702861 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:27:48 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) reconnecting
Jun 25 14:27:48 bmds1 kernel: Lustre: Skipped 6 previous similar messages
Jun 25 14:27:48 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:27:48 bmds1 kernel: Lustre: Skipped 6 previous similar messages
Jun 25 14:29:08 bmds1 kernel: LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 728s: evicting client at 10.21.22.24@tcp  ns: mdt-bravo-MDT0000_UUID lock: ffff883a8d51f000/0x6ae950
Jun 25 14:29:40 bmds1 kernel: LNet: Service thread pid 22365 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:29:40 bmds1 kernel: Pid: 22365, comm: mdt00_013
Jun 25 14:29:40 bmds1 kernel: 
Jun 25 14:29:40 bmds1 kernel: Call Trace:
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:29:40 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:29:40 bmds1 kernel: 
Jun 25 14:29:40 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403702980.22365
Jun 25 14:30:03 bmds1 kernel: LNet: Service thread pid 6954 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:30:03 bmds1 kernel: Pid: 6954, comm: mdt01_011
Jun 25 14:30:03 bmds1 kernel: 
Jun 25 14:30:03 bmds1 kernel: Call Trace:
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:03 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:30:03 bmds1 kernel: 
Jun 25 14:30:03 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703003.6954
Jun 25 14:30:40 bmds1 kernel: LNet: Service thread pid 6933 was inactive for 514.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:30:40 bmds1 kernel: Pid: 6933, comm: mdt00_009
Jun 25 14:30:40 bmds1 kernel: 
Jun 25 14:30:40 bmds1 kernel: Call Trace:
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:30:40 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:30:40 bmds1 kernel: 
Jun 25 14:30:40 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703040.6933
Jun 25 14:32:06 bmds1 kernel: LNet: Service thread pid 6937 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:32:06 bmds1 kernel: Pid: 6937, comm: mdt00_010
Jun 25 14:32:06 bmds1 kernel: 
Jun 25 14:32:06 bmds1 kernel: Call Trace:
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:32:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:32:06 bmds1 kernel: 
Jun 25 14:32:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703126.6937
Jun 25 14:32:29 bmds1 kernel: Lustre: 22454:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-383), not sending early reply
Jun 25 14:32:29 bmds1 kernel:  req@ffff881e34688000 x1468571555875772/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 4 to 0 dl 1403703154 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:32:52 bmds1 kernel: Lustre: lock timed out (enqueued at 1403702972, 200s ago)
Jun 25 14:32:52 bmds1 kernel: Lustre: Skipped 1 previous similar message
Jun 25 14:33:06 bmds1 kernel: LNet: Service thread pid 6910 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:33:06 bmds1 kernel: Pid: 6910, comm: mdt00_008
Jun 25 14:33:06 bmds1 kernel: 
Jun 25 14:33:06 bmds1 kernel: Call Trace:
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:33:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:33:06 bmds1 kernel: 
Jun 25 14:33:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703186.6910
Jun 25 14:33:32 bmds1 kernel: LNet: Service thread pid 6612 was inactive for 1200.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
Jun 25 14:33:32 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703212.6612
Jun 25 14:34:56 bmds1 kernel: Lustre: 6957:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-410), not sending early reply
Jun 25 14:34:56 bmds1 kernel:  req@ffff883b77b15400 x1468571557733836/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 2 to 0 dl 1403703301 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:35:06 bmds1 kernel: LNet: Service thread pid 6905 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:35:06 bmds1 kernel: Pid: 6905, comm: mdt00_005
Jun 25 14:35:06 bmds1 kernel: 
Jun 25 14:35:06 bmds1 kernel: Call Trace:
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:35:06 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:35:06 bmds1 kernel: 
Jun 25 14:35:06 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403703306.6905
Jun 25 14:35:57 bmds1 kernel: Lustre: 22588:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-410), not sending early reply
Jun 25 14:35:57 bmds1 kernel:  req@ffff881fb5a31000 x1468571558663648/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 576/3448 e 2 to 0 dl 1403703362 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:42:02 bmds1 kernel: Lustre: 22454:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jun 25 14:42:02 bmds1 kernel:  req@ffff88113611a800 x1468571563425972/t0(0) o101->1bfa42e7-94df-b8e8-2a7b-b32039aac0ce@10.21.22.23@tcp:0/0 lens 608/3448 e 0 to 0 dl 1403703727 ref 2 fl Interpret:/0/0 rc 0/0
Jun 25 14:42:08 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:42:08 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:42:08 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:42:08 bmds1 kernel: Lustre: Skipped 3 previous similar messages
Jun 25 14:49:32 bmds1 kernel: LNet: Service thread pid 6605 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 25 14:49:32 bmds1 kernel: Pid: 6605, comm: mdt00_000
Jun 25 14:49:32 bmds1 kernel: 
Jun 25 14:49:32 bmds1 kernel: Call Trace:
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:32 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:49:32 bmds1 kernel: 
Jun 25 14:49:32 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704172.6605
Jun 25 14:49:34 bmds1 kernel: Pid: 22449, comm: mdt02_019
Jun 25 14:49:34 bmds1 kernel: 
Jun 25 14:49:34 bmds1 kernel: Call Trace:
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:34 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:49:34 bmds1 kernel: 
Jun 25 14:49:34 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704174.22449
Jun 25 14:49:36 bmds1 kernel: Pid: 22504, comm: mdt01_021
Jun 25 14:49:36 bmds1 kernel: 
Jun 25 14:49:36 bmds1 kernel: Call Trace:
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03f62d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03e66fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06be0aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06b9790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bd758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc7c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06bdb30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc84f4>] mdt_object_lock+0x14/0x20 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dd75a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e6135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa070e646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e83c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dd83ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc4f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa069e831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06c51ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dc53a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0dcba97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06e7bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa0e053f5>] mds_regular_handle+0x15/0x20 [mdt]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f73c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03e65de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa03f7d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06ee729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f875e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffffa06f7c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jun 25 14:49:36 bmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 25 14:49:36 bmds1 kernel: 
Jun 25 14:49:36 bmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1403704176.22504
Jun 25 14:50:53 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) reconnecting
Jun 25 14:50:53 bmds1 kernel: Lustre: Skipped 62 previous similar messages
Jun 25 14:50:53 bmds1 kernel: Lustre: bravo-MDT0000: Client 1bfa42e7-94df-b8e8-2a7b-b32039aac0ce (at 10.21.22.23@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 14:50:53 bmds1 kernel: Lustre: Skipped 62 previous similar messages
Jun 25 15:00:55 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) reconnecting
Jun 25 15:00:55 bmds1 kernel: Lustre: Skipped 72 previous similar messages
Jun 25 15:00:55 bmds1 kernel: Lustre: bravo-MDT0000: Client b69b8925-7476-e90c-d5d5-edb6b51853ec (at 10.21.22.24@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 15:00:55 bmds1 kernel: Lustre: Skipped 72 previous similar messages
Jun 25 15:10:57 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) reconnecting
Jun 25 15:10:57 bmds1 kernel: Lustre: Skipped 72 previous similar messages
Jun 25 15:10:57 bmds1 kernel: Lustre: bravo-MDT0000: Client 0bb08de7-0631-0e52-3ff1-946375c89d85 (at 10.21.22.25@tcp) refused reconnection, still busy with 1 active RPCs
Jun 25 15:10:57 bmds1 kernel: Lustre: Skipped 72 previous similar messages
Jun 25 15:18:37 bmds1 init: tty (/dev/tty2) main process (8061) killed by TERM signal
Jun 25 15:18:37 bmds1 init: tty (/dev/tty3) main process (8063) killed by TERM signal
Jun 25 15:18:37 bmds1 init: tty (/dev/tty4) main process (8065) killed by TERM signal
Jun 25 15:18:37 bmds1 init: tty (/dev/tty5) main process (8069) killed by TERM signal
Jun 25 15:18:37 bmds1 init: tty (/dev/tty6) main process (8071) killed by TERM signal


 Comments   
Comment by Daire Byrne (Inactive) [ 25/Jun/14 ]

And on a related note - why do NFS exporters not recover seamlessly and subsequently the NFS clients? We have lots of reports of IO errors being returned to NFS clients which then requires that we restart all the running tasks. With a combination of Lustre's recovery and hard mounted NFSv3 clients this shouldn't happen right?

It seems like new opens/closes failed on the NFS clients but existing open files did recover as expected (the process hung until the filesystem came back). Why do new open/closes return IO errors instead of hanging until the filesystem comes back online?

Comment by Oleg Drokin [ 25/Jun/14 ]

Do you have many stripes in use, how many OSTs in this system?

Also any problems reported on the clients around the time when you have MDS enter this state?

Comment by Daire Byrne (Inactive) [ 25/Jun/14 ]

We don't stripe - there are 102 OSTs across 6 servers. A snippet of the interesting bits of a client log:

Jun 25 14:07:17 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:885 for /user_data (/user_data)
Jun 25 14:07:56 lustre5 kernel: Lustre: 15430:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403700753/real 1403700753]  req@ffff88039237f400 x1468581949625152/t0(0) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 576/3448 e 5 to 1 dl 1403701676 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jun 25 14:07:56 lustre5 kernel: Lustre: 15430:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
Jun 25 14:07:56 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jun 25 14:07:56 lustre5 kernel: Lustre: Skipped 16 previous similar messages
Jun 25 14:07:56 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:07:56 lustre5 kernel: LustreError: Skipped 1 previous similar message
Jun 25 14:08:21 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:08:46 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:09:11 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:09:36 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:10:01 lustre5 kernel: LustreError: 167-0: bravo-MDT0000-mdc-ffff880603c1cc00: This client was evicted by bravo-MDT0000; in progress operations using this service will fail.
Jun 25 14:10:01 lustre5 kernel: LustreError: 15443:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -5
Jun 25 14:10:01 lustre5 kernel: LustreError: 12482:0:(file.c:163:ll_close_inode_openhandle()) inode 144116128220761786 mdc close failed: rc = -5
Jun 25 14:10:01 lustre5 kernel: LustreError: 12482:0:(file.c:163:ll_close_inode_openhandle()) Skipped 148 previous similar messages
Jun 25 14:10:01 lustre5 kernel: LustreError: 15430:0:(file.c:2716:ll_inode_revalidate_fini()) bravo: revalidate FID [0x20000dae1:0x16c4e:0x0] error: rc = -5
Jun 25 14:10:01 lustre5 kernel: LustreError: 15430:0:(file.c:2716:ll_inode_revalidate_fini()) Skipped 3 previous similar messages
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -5
Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(lmv_obd.c:1289:lmv_statfs()) Skipped 1717 previous similar messages
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -5
Jun 25 14:10:01 lustre5 kernel: LustreError: 29154:0:(llite_lib.c:1610:ll_statfs_internal()) Skipped 1807 previous similar messages
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: LustreError: 8335:0:(vvp_io.c:1230:vvp_io_init()) bravo: refresh file layout [0x20000daa1:0x1ef7a:0x0] error -108.
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:10:01 lustre5 kernel: nfsd: non-standard errno: -108
..
.. 8,000 lines of nfsd errno: -108
..
Jun 25 14:24:44 lustre5 kernel: LustreError: 17760:0:(ldlm_resource.c:804:ldlm_resource_complain()) bravo-MDT0000-mdc-ffff880603c1cc00: namespace resource [0x20000daa1:0x1ef7a:0x0].0 (ffff880850d55080) refcount nonzero (1) after 
lock cleanup; forcing cleanup.
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: LustreError: 17760:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x20000daa1:0x1ef7a:0x0].0 (ffff880850d55080) refcount = 2
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: nfsd: non-standard errno: -108
Jun 25 14:24:44 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection restored to bravo-MDT0000 (at 10.21.22.50@tcp)
Jun 25 14:25:38 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.139:647 for /user_data (/user_data)
Jun 25 14:27:18 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:774 for /user_data (/user_data)
Jun 25 14:30:06 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.103:902 for /user_data (/user_data)
Jun 25 14:30:24 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.56.211:863 for /user_data (/user_data)
Jun 25 14:32:09 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.54.10:997 for /user_data (/user_data)
Jun 25 14:32:46 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.206:963 for /user_data (/user_data)
Jun 25 14:33:21 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.132:995 for /user_data (/user_data)
Jun 25 14:36:44 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.56.211:818 for /user_data (/user_data)
Jun 25 14:37:13 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.132:1002 for /user_data (/user_data)
Jun 25 14:38:05 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.64:1020 for /user_data (/user_data)
Jun 25 14:38:27 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.206:646 for /user_data (/user_data)
Jun 25 14:38:47 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.103:830 for /user_data (/user_data)
Jun 25 14:39:38 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.151:944 for /user_data (/user_data)
Jun 25 14:40:12 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.102:896 for /user_data (/user_data)
Jun 25 14:40:19 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.102:993 for /user_data (/user_data)
Jun 25 14:42:12 lustre5 kernel: Lustre: 15436:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403702976/real 1403702976]  req@ffff88085425d400 x1468581978403632/t0(0) o101->brav
o-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 608/3448 e 0 to 1 dl 1403703732 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jun 25 14:42:12 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff880603c1cc00: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jun 25 14:42:12 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:42:37 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:43:02 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:43:08 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.46.64:664 for /user_data (/user_data)
Jun 25 14:43:23 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.132:750 for /user_data (/user_data)
Jun 25 14:43:27 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:43:52 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:44:42 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:44:42 lustre5 kernel: LustreError: Skipped 1 previous similar message
Jun 25 14:45:57 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:45:57 lustre5 kernel: LustreError: Skipped 2 previous similar messages
Jun 25 14:47:45 lustre5 rpc.mountd[15409]: authenticated unmount request from 10.21.20.102:779 for /user_data (/user_data)
Jun 25 14:48:27 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:48:27 lustre5 kernel: LustreError: Skipped 5 previous similar messages
Jun 25 14:53:02 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 14:53:02 lustre5 kernel: LustreError: Skipped 10 previous similar messages
Jun 25 15:01:47 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 15:01:47 lustre5 kernel: LustreError: Skipped 20 previous similar messages
Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -16
Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(lmv_obd.c:1289:lmv_statfs()) Skipped 5 previous similar messages
Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16
Jun 25 15:10:16 lustre5 kernel: LustreError: 21903:0:(llite_lib.c:1610:ll_statfs_internal()) Skipped 5 previous similar messages
Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:433:ll_get_dir_page()) read cache page: [0x200000007:0x1:0x0] at 0: rc -4
Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:595:ll_dir_read()) error reading dir [0x200000007:0x1:0x0] at 0: rc -4
Jun 25 15:10:19 lustre5 kernel: LustreError: 21904:0:(dir.c:595:ll_dir_read()) Skipped 15 previous similar messages
Jun 25 15:12:12 lustre5 kernel: LustreError: 11-0: bravo-MDT0000-mdc-ffff880603c1cc00: Communicating with 10.21.22.50@tcp, operation mds_connect failed with -16.
Jun 25 15:12:12 lustre5 kernel: LustreError: Skipped 24 previous similar messages
Jun 25 15:14:54 lustre5 kernel: LustreError: 22411:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0 (bravo-MDT0000-mdc-ffff880603c1cc00), error -16
Jun 25 15:14:54 lustre5 kernel: LustreError: 22411:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -16
Jun 25 15:16:47 lustre5 rshd[22567]: root@batch1.dhcp.dneg.com as root: cmd='/usr/bin/tail -100 /var/log/messages | egrep 'nfs|aquire_lock''
Jun 25 15:19:17 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff8805fd416000: Connection to bravo-MDT0000 (at 10.21.22.50@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jun 25 15:22:44 lustre5 kernel: Lustre: 3862:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706157/real 1403706157]  req@ffff8805813b3400 x1468581992549260/t0(0) o400->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 224/224 e 0 to 1 dl 1403706164 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:22:44 lustre5 kernel: LustreError: 166-1: MGC10.21.22.50@tcp: Connection to MGS (at 10.21.22.50@tcp) was lost; in progress operations using this service will fail
Jun 25 15:22:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706164/real 1403706164]  req@ffff88039c23a400 x1468581992550088/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706170 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:23:20 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706189/real 1403706192]  req@ffff88035edce000 x1468581992550096/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706200 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:23:23 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706157/real 1403706157]  req@ffff880407e56400 x1468581992549676/t0(0) o38->bravo-MDT0000-mdc-ffff8805fd416000@10.21.22.50@tcp:12/10 lens 400/544 e 0 to 1 dl 1403706203 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:23:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706214/real 1403706217]  req@ffff8802a7ea1400 x1468581992550916/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706230 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:23:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jun 25 15:24:20 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706239/real 1403706242]  req@ffff880add1ee800 x1468581992551744/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706260 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:24:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706264/real 1403706267]  req@ffff880badd91c00 x1468581992552564/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706290 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:24:50 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jun 25 15:25:44 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706289/real 1403706292]  req@ffff88099ed52c00 x1468581992553796/t0(0) o38->bravo-MDT0000-mdc-ffff8805fd416000@10.21.22.50@tcp:12/10 lens 400/544 e 0 to 1 dl 1403706344 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:26:40 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706364/real 1403706367]  req@ffff8808a519e000 x1468581992555844/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706400 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:26:40 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jun 25 15:28:05 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1403706439/real 1403706441]  req@ffff880a2ad93000 x1468581992558308/t0(0) o250->MGC10.21.22.50@tcp@10.21.22.50@tcp:26/25 lens 400/544 e 0 to 1 dl 1403706485 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 25 15:28:05 lustre5 kernel: Lustre: 3850:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jun 25 15:28:34 lustre5 kernel: Lustre: Evicted from MGS (at 10.21.22.50@tcp) after server handle changed from 0x6ae95a77b3c55db to 0xca8b7428d5cc6ff2
Jun 25 15:28:34 lustre5 kernel: Lustre: MGC10.21.22.50@tcp: Connection restored to MGS (at 10.21.22.50@tcp)
Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88036a75fc00 x1468581974497852/t177813398500(177813398500) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 576/544 e 0 to 0 dl 1403706571 ref 2 fl Interpret:RP/4/0 rc 301/301
Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) Skipped 39 previous similar messages
Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff8809243c2800 x1468581983767496/t177813992103(177813992103) o101->bravo-MDT0000-mdc-ffff880603c1cc00@10.21.22.50@tcp:12/10 lens 592/544 e 0 to 0 dl 1403706571 ref 2 fl Interpret:RP/4/0 rc 301/301
Jun 25 15:28:34 lustre5 kernel: LustreError: 3850:0:(client.c:2696:ptlrpc_replay_interpret()) Skipped 100 previous similar messages
Jun 25 15:28:36 lustre5 kernel: Lustre: bravo-MDT0000-mdc-ffff8805fd416000: Connection restored to bravo-MDT0000 (at 10.21.22.50@tcp)
Jun 25 15:29:31 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.225:846 for /user_data (/user_data)
Jun 25 15:30:07 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.46.206:753 for /user_data (/user_data)
Jun 25 15:30:49 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.103:915 for /user_data (/user_data)
Jun 25 15:31:31 lustre5 rpc.mountd[15409]: authenticated mount request from 10.21.20.102:835 for /user_data (/user_data)
Comment by Peter Jones [ 25/Jun/14 ]

Bruno

Oleg suspects that this may be an example of LU-2827. Do you agree?

Thanks

Peter

Comment by Bruno Faccini (Inactive) [ 27/Jun/14 ]

Peter, according to the fact that there is no striping used, and that the situation does not clear/recover after MDS has evicted Clients, my feeling is that problem looks different than the scenarios addressed in LU-2827 (and LU-4584).

Daire, you indicated that problems already occured 3 times in 3 weeks, right? So if yes, have you been able to take a forced crash-dump of the MDS instead to simply reboot it ? Or at least, did you take time to request a full stacks trace (Alt+SysRq+T or echo t > /proc/sys/sysrq-trigger) before to reboot MDS ?
If none, can this be done next time ?

Comment by Daire Byrne (Inactive) [ 27/Jun/14 ]

No we did not take any traces before rebooting. I can certainly make sure the team here know to do that next time. And yes it has occurred 3-4 times in the last month.

Comment by Daire Byrne (Inactive) [ 14/Jul/14 ]

So this just happened again but unfortunately we could not even get a local console on the server to trigger a sysreq. We had to power cycle.

Comment by Daire Byrne (Inactive) [ 14/Jul/14 ]

So we managed to get a partial sysreq c/kdump. After we hit the power button we got a minute or so of console access. I'm attaching the dmesg but I'm not sure how useful the data is....

Comment by Bruno Faccini (Inactive) [ 18/Jul/14 ]

Hello Daire,
I went thru the (partial?) Console output you attached (as ) and I am a bit puzzled with what I found inside ...
1st of all, I don't find any hung MDT thread with the same kind of stack that was dumped as part of your submission text. In fact, all MDT threads dumped stacks shows only idle (with ptlrpc_main()/ptlrpc_wait_event()/cfs_waitq_wait() last/top call-sequence) ones!!

More strange are the following processes/stacks :

<6>rc            S 0000000000000001     0 31964      1 0x00000000
<4> ffff8814a4671e78 0000000000000082 0000000000000000 0000000000000029
<4> ffff882d77417280 000000000238c0b8 ffff88107406a2d8 0000000000000007
<4> ffff885fc1d9c638 ffff8814a4671fd8 000000000000fb88 ffff885fc1d9c638
<4>Call Trace:
<4> [<ffffffff81072a94>] do_wait+0x1e4/0x240
<4> [<ffffffff81072b93>] sys_wait4+0xa3/0x100
<4> [<ffffffff81071130>] ? child_wait_callback+0x0/0x70
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

<6>K75netfs      S 0000000000000001     0 32228  31964 0x00000000
<4> ffff882fe332fe78 0000000000000086 0000000000000000 0000000000000029
<4> ffff885fe4db68c0 0000000000f27990 ffff885fe6053c38 0000000000000007
<4> ffff885fe6168638 ffff882fe332ffd8 000000000000fb88 ffff885fe6168638
<4>Call Trace:
<4> [<ffffffff81072a94>] do_wait+0x1e4/0x240
<4> [<ffffffff81072b93>] sys_wait4+0xa3/0x100
<4> [<ffffffff81071130>] ? child_wait_callback+0x0/0x70
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

<6>umount        R  running task        0 32242  32228 0x00000000
<4> ffff882d5be15748 0000000000000000 ffffea00645bc5e8 ffff88000001c9c0
<4> ffff882d5be15778 ffffffff8112f15d ffff882d5be15778 ffff88000001c9c0
<4> ffff88000001c9c0 ffff88000001c9c0 ffff882d5be15808 ffffffff81130245
<4>Call Trace:
<4> [<ffffffff81170ae0>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
<4> [<ffffffff8127cbf1>] radix_tree_gang_lookup_slot+0x81/0xf0
<4> [<ffffffffa0b75e67>] ? jbd2_journal_try_to_free_buffers+0xa7/0x150 [jbd2]
<4> [<ffffffff8113031d>] ? ____pagevec_lru_deactivate+0x20d/0x2d0
<4> [<ffffffffa0c32901>] ? ldiskfs_fh_to_dentry+0x1/0x20 [ldiskfs]
<4> [<ffffffff81130558>] ? deactivate_page+0x58/0x80
<4> [<ffffffff81130a85>] ? invalidate_mapping_pages+0x165/0x1e0
<4> [<ffffffff811b631b>] ? invalidate_bdev+0x3b/0x50
<4> [<ffffffff811e1a49>] ? vfs_quota_disable+0x3a9/0x7d0
<4> [<ffffffffa0c0dfa3>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
<4> [<ffffffffa0c36900>] ? ldiskfs_quota_off+0xe0/0x1f0 [ldiskfs]
<4> [<ffffffff81183eb6>] ? deactivate_super+0x46/0x80
<4> [<ffffffff811a21bf>] ? mntput_no_expire+0xbf/0x110
<4> [<ffffffffa0c816d9>] ? osd_umount+0x79/0x150 [osd_ldiskfs]
<4> [<ffffffffa0c864c7>] ? osd_device_fini+0x147/0x190 [osd_ldiskfs]
<4> [<ffffffffa0509bf7>] ? class_cleanup+0x577/0xda0 [obdclass]
<4> [<ffffffffa04deb36>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa050b4dc>] ? class_process_config+0x10bc/0x1c80 [obdclass]
<4> [<ffffffffa0504d03>] ? lustre_cfg_new+0x353/0x7e0 [obdclass]
<4> [<ffffffffa050c219>] ? class_manual_cleanup+0x179/0x6f0 [obdclass]
<4> [<ffffffffa03a3717>] ? cfs_waitq_broadcast+0x17/0x20 [libcfs]
<4> [<ffffffffa04dced2>] ? class_export_put+0x102/0x2c0 [obdclass]
<4> [<ffffffffa0c87655>] ? osd_obd_disconnect+0x1c5/0x1d0 [osd_ldiskfs]
<4> [<ffffffffa050e2ae>] ? lustre_put_lsi+0x17e/0x1100 [obdclass]
<4> [<ffffffffa0516ff8>] ? lustre_common_put_super+0x5f8/0xc40 [obdclass]
<4> [<ffffffffa0540b7a>] ? server_put_super+0x1ca/0xf00 [obdclass]
<4> [<ffffffff8118363b>] ? generic_shutdown_super+0x5b/0xe0
<4> [<ffffffff81183726>] ? kill_anon_super+0x16/0x60
<4> [<ffffffffa050e0d6>] ? lustre_kill_super+0x36/0x60 [obdclass]
<4> [<ffffffff81183ec7>] ? deactivate_super+0x57/0x80
<4> [<ffffffff811a21bf>] ? mntput_no_expire+0xbf/0x110
<4> [<ffffffff811a2c2b>] ? sys_umount+0x7b/0x3a0
<4> [<ffffffff810864b1>] ? sigprocmask+0x71/0x110
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Which clearly indicate that a shutdown was started and MDT being umounted, and this may be since quite a long time :

<4>runnable tasks:
<4>            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
<4>----------------------------------------------------------------------------------------------------------
<4>R         umount 32242 291143865.264277       402   120 291143865.264277     33752.842001     36352.713363 /
<4>

Last, about the failed crash-dump attempt via Alt+SysReq+c, is kdump correctly installed+configured ?

Comment by Daire Byrne (Inactive) [ 22/Jul/14 ]

Yea it looks like the dump isn't any good. Basically the console was unresponsive so we had to power cycle and so the dump is actually from when the machine was shutting down. That is probably why the MDT threads are idle and the kdump was incomplete. We will have to try again next time we see this. The frequency seems to have dropped to around once a month now.

Comment by Daire Byrne (Inactive) [ 28/Aug/14 ]

We had another instance of this. Couldn't login to the console etc. Managed to sysrq crash the system. I've attached the vmcore-dmesg.txt but I'm not sure if there is anything too interesting in it.

Comment by Bruno Faccini (Inactive) [ 29/Aug/14 ]

Thanks Daire, there are interesting infos in this new vmcore-dmesg.txt log!
Eth link down/up msgs (with associated/related Lustre timeouts/evictions msgs??), numerous "power limit" msgs that sem to indicate some CPU power regulation ..., stacks dumps of threads stuck awaiting identity upcall completion, a stack dump for one thread stuck in memory alloc/shrink, ...
The problem with this log is that there is no time-stamp to help determine the timing sequence of all these msgs, so can you provide this MDS's syslog ? And BTW, can you make the crash-dump available ??

Comment by Daire Byrne (Inactive) [ 01/Sep/14 ]

Bruno,

I've uploaded the (5GB) vmcore and syslog messages to dropbox - https://db.tt/qJpidSFG . I hope it is of use to you.

Comment by Bruno Faccini (Inactive) [ 17/Sep/14 ]

Hello Daire,
Sorry to be late in giving you feedback about this crash-dump.
Its analysis shows that :
_ there are some mdt_... threads are stuck for hundreds of seconds waiting for the identity_upcall execution.
_ khelper thread is spinning on zone->lru_lock trying trying to reclaim memory and be able to fork
_ a lot of others threads are competing in doing so.

And thus seems to me that we face a kernel/vm weakness (may be tuning?) issue during memory heavy pressure (on a server/MDS with 360GB memory!), causing a contention on zone->lru_lock (and associated mm/vm structs/lists) and thus high response delay of all threads/processes including MDS ones.

So my understanding is that, due to what I just described before, the whole Node/Server operations are really slow including MDS ones.

Comment by Daire Byrne (Inactive) [ 18/Sep/14 ]

Okay, cheers. Well we don't make any changes of note to the default EL6 tunings. The only things I can see that may be related are:

# jb: pdflush tunables. linuxnfs have 384GB ram so flush to disk fast
vm.dirty_background_bytes = 2147483648
vm.dirty_bytes = 4294967296
vm.dirty_expire_centisecs = 500

vm.min_free_kbytes = 2097152

# daire: allow some swapping (default=40)
vm.swappiness = 20

But I can't imagine any of these would be relevant. It is true that the whole server grinds to a halt but I guess we are no closer to figuring out what causes it.

Comment by Bruno Faccini (Inactive) [ 18/Sep/14 ]

I think the best fix will be to prevent MDS node to enter the vm/memory-reclaim related code that seems not very good to handle the case on a huge memory node.

I have made an audit of memory usage in crash-dump and here are the significant points:
_ 340GB are used by buffers !!
_ almost all for MDT device inodes
_ there a lot of processes (automount, nfsd, smbd, java, ...) that seem not expected to run on an MDS.

Why don't you tune/limit this MDS to not let buffers to fill all memory and leave enough room for MDS processes/threads and slabs without the need of memory-reclaim ?

Comment by Daire Byrne (Inactive) [ 18/Sep/14 ]

Okay I will look into limiting the buffer size (/proc/sys/vm/swappiness and/or vm.min_free_kbytes?). The main reason so many inodes end up in buffer is that we use this filesystem to backup all our production storage so a full scan occurs every night across ~157987516 files.

Comment by Bruno Faccini (Inactive) [ 19/Sep/14 ]

I think in your particular config (huge memory node!) and usage (full scan of FS causing a lot of MDT/meta-datas blocks to fill buffer-cache) case the best parameter to address it and to be changed would be to set swappiness higher, and even to 100 (%).

Comment by Daire Byrne (Inactive) [ 13/Oct/14 ]

So it happened again last night after a couple of weeks with swapiness=90. I'll try setting it to 100 but I can't really see much difference in the memory usage graphs. I've attached the messages from overnight.

Comment by Daire Byrne (Inactive) [ 13/Oct/14 ]

Maybe this is related to LU-5726? We are constantly deleting large amounts of files from the server (cleaning up old server snapshots).

Comment by Bruno Faccini (Inactive) [ 23/Oct/14 ]

Hello Daire,
First of all I am sorry that our idea to change vm_swappiness parameter has not been successful.
Also before we start to suspect a leak, I would like you to try some more things on the MDS during the full-scan :
1) monitor the memory/page-cache usage, with regular "free" commands or better "cat /proc/meminfo", or with "sar -r".
2) at some point of time when the memory usage becomes to be quite high, run the "echo 1 > /proc/sys/vm/drop_caches" cmd and continue to monitor to see if page-cache usage numbers change or not. If this does not work, we may start to think about some possible leak situation and may be LU-5726 ...
3) if it works, then it should be good to give a try to extra_free_kbytes parameter and to set its value to the same than min_free_kbytes (2097152 ?) as a 1st try.

Woud be cool if you could provide me with at least the "cat /proc/meminfo" output just before you run the "echo 1 > /proc/sys/vm/drop_caches" cmd , and againthe "cat /proc/meminfo" output just after.

Comment by Daire Byrne (Inactive) [ 28/Oct/14 ]

Bruno,

I have actually setup a cron job the "echo 3 > /proc/sys/vm/drop_caches" every Saturday at midnight (graph included). As you can see the memory quickly gets filled again on these busy filesystems (our backups also start at midnight). I've attached the meminfo before and after dropping caches as you requested (also seen at the end of the attached graph).

We are currently running with swappiness=100. I'll add the extra_free_kbytes setting as well. All we can do is wait a few weeks to see if it falls over again.

Comment by Daire Byrne (Inactive) [ 03/Dec/14 ]

Just to update - we have not seen this memory pressure crash in 51 days. The kswapd constantly uses 50-100% CPU but it that helps keep the server running then so be it.

# free
             total       used       free     shared    buffers     cached
Mem:     396857076  366020780   30836296          0  345853496      12948
-/+ buffers/cache:   20154336  376702740
Swap:     16777208     763064   16014144

top - 11:59:34 up 51 days,  3:24,  1 user,  load average: 2.43, 2.27, 2.75
Tasks: 1718 total,   2 running, 1716 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  5.0%sy,  0.0%ni, 87.6%id,  7.2%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  396857076k total, 366033176k used, 30823900k free, 345864496k buffers
Swap: 16777208k total,   763100k used, 16014108k free,    16216k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                     
  499 root      20   0     0    0    0 D 81.0  0.0   6426:53 kswapd1                                                                                                                                                                     
 5061 root      20   0     0    0    0 S  9.0  0.0   6033:55 socknal_sd03_00                                                                                                                                                             
 6836 root      20   0     0    0    0 S  9.0  0.0 615:03.16 mdt07_016                                                                                                                                                                   
 6747 root      20   0     0    0    0 D  5.4  0.0 405:22.90 mdt03_043                                                                                                                                                                   
 6833 root      20   0     0    0    0 S  5.4  0.0 585:46.06 mdt07_013                                                                                                                                                                   
16840 root      20   0 27136 2708 1076 R  5.4  0.0   0:00.06 top                                                                                                                                                                         
17061 root      20   0     0    0    0 S  5.4  0.0 563:14.19 mdt00_050
21161 root      20   0     0    0    0 S  5.4  0.0 476:50.74 mdt03_039

vm.min_free_kbytes = 4194304
vm.extra_free_kbytes = 4194304
vm.vfs_cache_pressure = 100
vm.swappiness = 100

00 23 * * 0 echo 3 > /proc/sys/vm/drop_caches
Comment by Bruno Faccini (Inactive) [ 04/Dec/14 ]

Hello Daire,,
Thanks for this update.
I think, and your graph showing a constant 360GB of buffers in-memory seems to prove it, that we may have found the right VM tuning for your huge memory MDS node.
The fact that "echo 3 > /proc/sys/vm/drop_caches" frees almost all buffers that were filling memory is definitely the proof that we are not facing a memory-leak condition.
The fact that buffers quickly re-fill memory clearly indicate that your MDS activity is heavy and probably constant, and also that the once per week drop of caches is not necessary.
Last, if kswapd is always very CPU-bound and thus regularly moving stuff out of memory is the consequence of the VM tuning we have applied against the huge meta-data/full-scan activity. But since the CPU-load average of your system stays low this has almost no impact on others thread/work scheduling.

Comment by Daire Byrne (Inactive) [ 23/Jan/15 ]

Still no recurrence of this since making the sysctl tweaks. You can close it now. Cheers.

Comment by Peter Jones [ 23/Jan/15 ]

That's good news - thanks Daire!

Comment by Daire Byrne (Inactive) [ 16/Mar/15 ]

We may have just seen this again after many months and swapiness=100. Same symptoms - the MDS becomes unresponsive and requires a hard reboot.

I've attached the messages - if it looks like a different issue I can open a new ticket.

Cheers.

Comment by Daire Byrne (Inactive) [ 16/Mar/15 ]

It may still be us. Apparently we are setting swappiness=20 and then swappiness=100 directly afterwards every hour. Don't you love config management systems....

Comment by Bruno Faccini (Inactive) [ 24/Mar/15 ]

Hello Daire,
The messages log you have posted looks similar than for the previous occurrences, but since the threads/stacks that are dumped are only consequences, it is is difficult to confirm this was the same situation. Don't you had the time to take a full stack-traces dump (Alt+SysRq+T) or better a crash-dump??

Comment by Daire Byrne (Inactive) [ 24/Jun/15 ]

Sorry, I forgot about this ticket. You can close it again now - we have fixed our config management system.

Comment by Peter Jones [ 24/Jun/15 ]

ok - thanks Daire

Comment by Daire Byrne (Inactive) [ 07/Dec/15 ]

Sorry... this is still an issue (running out of memory) - we just have to auto reboot the server once every month. We recently updated to v2.5.3 and it seems like the frequency of the MDS crashes has increased to weekly reboots. We are still running with:

vm.min_free_kbytes = 4194304
vm.extra_free_kbytes = 4194304
vm.swappiness = 100

Is there anything else we can try to avoid running out of memory so frequently? I'll try and get another sysreq dump of this with v.2.5.3 but it's tricky with such a large memory node.

Comment by Bruno Faccini (Inactive) [ 07/Dec/15 ]

Hello Daire,
Sorry that this is still an issue.
Do you mean that before the 2.5.3 upgrade you also had to force MDS reboot every month to avoid the OOM ?
Do you have a crash-dump available from one of this recent/2.5.3 OOM situation?
Can you also post the "numactl --harware" and "cat /proc/zoneinfo" cmds output ? This after MDS has been running under production workload since quite a long time.

Comment by Daire Byrne (Inactive) [ 07/Dec/15 ]

Do you mean that before the 2.5.3 upgrade you also had to force MDS reboot every month to avoid the OOM ?

More like we detect the memory problem first and then auto reboot the server (if free<10G). It used to happened once a month give or take a week. The load on these servers is pretty consistent from day to day.

Do you have a crash-dump available from one of this recent/2.5.3 OOM situation?

No not yet, I have instructed our engineers to do this next time and I have disabled the automatic reboot so they can maybe sysreq it.

Can you also post the "numactl --hardware" and "cat /proc/zoneinfo" cmds output

I will endeavor to get that information to you in a few days time. In the meantime, here is the output from the MDS server after only 1 day of uptime since the last incident.

bmds1 / # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 4982 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 17497 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

bmds1 / # cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3934
        min      60
        low      135
        high     150
        scanned  0
        spanned  4095
        present  3835
    nr_free_pages 3934
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     1
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   1
    nr_anon_transparent_hugepages 0
        protection: (0, 1931, 129191, 129191)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 12
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 13
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 14
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 15
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 16
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 17
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 18
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 19
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 20
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 21
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 22
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 23
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 24
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 25
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 26
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 27
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 28
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 29
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 30
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 31
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         1
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     110046
        min      7833
        low      17624
        high     19582
        scanned  0
        spanned  1044480
        present  494376
    nr_free_pages 110046
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     2029
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   2017
    numa_other   12
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 127260, 127260)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 4
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 6
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 8
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 12
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 14
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 28
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         4096
  inactive_ratio:    3
Node 0, zone   Normal
  pages free     1161377
        min      516243
        low      1161546
        high     1290607
        scanned  0
        spanned  33030144
        present  32578560
    nr_free_pages 1161377
    nr_inactive_anon 16264
    nr_active_anon 4646
    nr_inactive_file 13336802
    nr_active_file 13708442
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 18320
    nr_mapped    3845
    nr_file_pages 27045274
    nr_dirty     12702
    nr_writeback 0
    nr_slab_reclaimable 3244308
    nr_slab_unreclaimable 1087317
    nr_page_table_pages 1220
    nr_kernel_stack 597
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     30
    numa_hit     7687506784
    numa_miss    25118087
    numa_foreign 3245810767
    numa_interleave 60017
    numa_local   7687483560
    numa_other   25141311
    nr_anon_transparent_hugepages 5
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 87
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 100
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 186
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 122
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 163
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 182
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 161
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 86
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 61
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 169
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 54
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 2
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 128
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 3
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 142
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 44
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 165
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 148
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 172
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 170
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         1048576
  inactive_ratio:    35
Node 1, zone   Normal
  pages free     4336363
        min      524437
        low      1179983
        high     1311092
        scanned  0
        spanned  33554432
        present  33095680
    nr_free_pages 4336363
    nr_inactive_anon 15126
    nr_active_anon 2017
    nr_inactive_file 12266036
    nr_active_file 12800791
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 14012
    nr_mapped    2008
    nr_file_pages 25067210
    nr_dirty     127401
    nr_writeback 0
    nr_slab_reclaimable 2577430
    nr_slab_unreclaimable 876760
    nr_page_table_pages 1226
    nr_kernel_stack 957
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 25
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     34
    numa_hit     3879043142
    numa_miss    3245810767
    numa_foreign 25118087
    numa_interleave 60079
    numa_local   3878729990
    numa_other   3246123919
    nr_anon_transparent_hugepages 6
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 183
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 185
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 46
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 159
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 35
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 27
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 204
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 194
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 24
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 101
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 152
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 59
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 34
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 32
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 129
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 32
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 189
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 165
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 33
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 43
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 183
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 45
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 155
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 130
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 41
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 178
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 167
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 122
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 184
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 163
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         34078720
  inactive_ratio:    35
Comment by Bruno Faccini (Inactive) [ 09/Dec/15 ]

Well, it is a bit strange that processes/threads running on Node 1 are heavily trying to allocate on Node 0 ...
Would be interesting to see if this is confirmed by new/next numbers you will gather.
Also, do you use any kind of special parameters for ptlrpcd threads binding/neighbor configuration?
Can you also post an "lspci -tv" cmd output for the MDS?

Comment by Daire Byrne (Inactive) [ 09/Dec/15 ]

We don't have any special ptlrpcd configuration. Here's another grab of the numactl and zoneinfo output (after 3 days of uptime now) + the requested lspci -tv

bmds1 /root # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 4981 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 16123 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

bmds1 /root # cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3934
        min      60
        low      135
        high     150
        scanned  0
        spanned  4095
        present  3835
    nr_free_pages 3934
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     1
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   1
    nr_anon_transparent_hugepages 0
        protection: (0, 1931, 129191, 129191)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 12
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 13
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 14
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 15
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 16
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 17
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 18
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 19
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 20
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 21
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 22
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 23
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 24
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 25
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 26
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 27
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 28
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 29
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 30
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 31
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         1
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     110077
        min      7833
        low      17624
        high     19582
        scanned  0
        spanned  1044480
        present  494376
    nr_free_pages 110077
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     3722
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   3705
    numa_other   17
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 127260, 127260)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 4
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 6
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 12
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 14
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 28
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         4096
  inactive_ratio:    3
Node 0, zone   Normal
  pages free     1161500
        min      516243
        low      1161546
        high     1290607
        scanned  0
        spanned  33030144
        present  32578560
    nr_free_pages 1161500
    nr_inactive_anon 16343
    nr_active_anon 4794
    nr_inactive_file 13047605
    nr_active_file 13835116
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 19541
    nr_mapped    1956
    nr_file_pages 26882808
    nr_dirty     16798
    nr_writeback 0
    nr_slab_reclaimable 3075559
    nr_slab_unreclaimable 1416404
    nr_page_table_pages 1319
    nr_kernel_stack 601
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 3854
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     31
    numa_hit     24437944447
    numa_miss    201935750
    numa_foreign 9386764171
    numa_interleave 60017
    numa_local   24437875632
    numa_other   202004565
    nr_anon_transparent_hugepages 3
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 152
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 119
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 214
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 148
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 168
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 127
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 56
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 137
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 236
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 179
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 187
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 11
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 61
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 16
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 137
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 62
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 145
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 101
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 130
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 183
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 116
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 187
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 185
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 222
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 157
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         1048576
  inactive_ratio:    35
Node 1, zone   Normal
  pages free     3876044
        min      524437
        low      1179983
        high     1311092
        scanned  0
        spanned  33554432
        present  33095680
    nr_free_pages 3876044
    nr_inactive_anon 16221
    nr_active_anon 564
    nr_inactive_file 12740495
    nr_active_file 13455109
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 13551
    nr_mapped    2654
    nr_file_pages 26198065
    nr_dirty     141941
    nr_writeback 0
    nr_slab_reclaimable 2174549
    nr_slab_unreclaimable 606113
    nr_page_table_pages 1296
    nr_kernel_stack 975
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 890
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     32
    numa_hit     10920091963
    numa_miss    9386764171
    numa_foreign 201935750
    numa_interleave 60079
    numa_local   10919375847
    numa_other   9387480287
    nr_anon_transparent_hugepages 6
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 193
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 195
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 51
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 124
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 71
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 226
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 82
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 203
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 73
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 168
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 114
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 56
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 54
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 170
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 149
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 231
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 199
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 54
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 38
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 78
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 206
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 57
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 62
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 214
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 96
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 171
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 77
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 186
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 179
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 180
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 82
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         34078720
  inactive_ratio:    35

bmds1 /root # lspci -tv
-+-[0000:7f]-+-08.0  Intel Corporation Xeon E5/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E5/Core i7 QPI Link 1
 |           +-0a.0  Intel Corporation Xeon E5/Core i7 Power Control Unit 0
 |           +-0a.1  Intel Corporation Xeon E5/Core i7 Power Control Unit 1
 |           +-0a.2  Intel Corporation Xeon E5/Core i7 Power Control Unit 2
 |           +-0a.3  Intel Corporation Xeon E5/Core i7 Power Control Unit 3
 |           +-0b.0  Intel Corporation Xeon E5/Core i7 Interrupt Control Registers
 |           +-0b.3  Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers
 |           +-0c.0  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.1  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.2  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.3  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0
 |           +-0c.7  Intel Corporation Xeon E5/Core i7 System Address Decoder
 |           +-0d.0  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.1  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.2  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.3  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1
 |           +-0e.0  Intel Corporation Xeon E5/Core i7 Processor Home Agent
 |           +-0e.1  Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring
 |           +-0f.0  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers
 |           +-0f.1  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers
 |           +-0f.2  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0
 |           +-0f.3  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1
 |           +-0f.4  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2
 |           +-0f.5  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3
 |           +-0f.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4
 |           +-10.0  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0
 |           +-10.1  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1
 |           +-10.2  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0
 |           +-10.3  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1
 |           +-10.4  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2
 |           +-10.5  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3
 |           +-10.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2
 |           +-10.7  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3
 |           +-11.0  Intel Corporation Xeon E5/Core i7 DDRIO
 |           +-13.0  Intel Corporation Xeon E5/Core i7 R2PCIe
 |           +-13.1  Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor
 |           +-13.4  Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers
 |           +-13.5  Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor
 |           \-13.6  Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor
 +-[0000:40]-+-01.0-[41]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
 |           +-03.0-[42]----00.0  LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2
 |           +-05.0  Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
 |           \-05.2  Intel Corporation Xeon E5/Core i7 Control Status and Global Errors
 +-[0000:3f]-+-08.0  Intel Corporation Xeon E5/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E5/Core i7 QPI Link 1
 |           +-0a.0  Intel Corporation Xeon E5/Core i7 Power Control Unit 0
 |           +-0a.1  Intel Corporation Xeon E5/Core i7 Power Control Unit 1
 |           +-0a.2  Intel Corporation Xeon E5/Core i7 Power Control Unit 2
 |           +-0a.3  Intel Corporation Xeon E5/Core i7 Power Control Unit 3
 |           +-0b.0  Intel Corporation Xeon E5/Core i7 Interrupt Control Registers
 |           +-0b.3  Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers
 |           +-0c.0  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.1  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.2  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.3  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0c.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0
 |           +-0c.7  Intel Corporation Xeon E5/Core i7 System Address Decoder
 |           +-0d.0  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.1  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.2  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.3  Intel Corporation Xeon E5/Core i7 Unicast Register 0
 |           +-0d.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1
 |           +-0e.0  Intel Corporation Xeon E5/Core i7 Processor Home Agent
 |           +-0e.1  Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring
 |           +-0f.0  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers
 |           +-0f.1  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers
 |           +-0f.2  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0
 |           +-0f.3  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1
 |           +-0f.4  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2
 |           +-0f.5  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3
 |           +-0f.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4
 |           +-10.0  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0
 |           +-10.1  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1
 |           +-10.2  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0
 |           +-10.3  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1
 |           +-10.4  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2
 |           +-10.5  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3
 |           +-10.6  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2
 |           +-10.7  Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3
 |           +-11.0  Intel Corporation Xeon E5/Core i7 DDRIO
 |           +-13.0  Intel Corporation Xeon E5/Core i7 R2PCIe
 |           +-13.1  Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor
 |           +-13.4  Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers
 |           +-13.5  Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor
 |           \-13.6  Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor
 \-[0000:00]-+-00.0  Intel Corporation Xeon E5/Core i7 DMI2
             +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
             |            +-00.1  Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
             |            +-00.2  Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
             |            \-00.3  Broadcom Corporation NetXtreme II BCM57800 1/10 Gigabit Ethernet
             +-02.0-[03]--
             +-02.2-[02]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt]
             +-03.0-[04]----00.0  Virident Systems Inc. Virident FlashMAX Drive V2
             +-05.0  Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
             +-05.2  Intel Corporation Xeon E5/Core i7 Control Status and Global Errors
             +-11.0-[05]--
             +-16.0  Intel Corporation C600/X79 series chipset MEI Controller #1
             +-16.1  Intel Corporation C600/X79 series chipset MEI Controller #2
             +-1a.0  Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2
             +-1c.0-[06]--
             +-1c.7-[07-0b]----00.0-[08-0b]--+-00.0-[09-0a]----00.0-[0a]----00.0  Matrox Electronics Systems Ltd. G200eR2
             |                               \-01.0-[0b]--
             +-1d.0  Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1
             +-1e.0-[0c]--
             +-1f.0  Intel Corporation C600/X79 series chipset LPC Controller
             \-1f.2  Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller
Comment by Daire Byrne (Inactive) [ 14/Dec/15 ]

After 8 days uptime - still waiting for the inevitable crash...

bmds1 # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 8025 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 17273 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

bmds1 /root # cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3934
        min      60
        low      135
        high     150
        scanned  0
        spanned  4095
        present  3835
    nr_free_pages 3934
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     1
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   1
    nr_anon_transparent_hugepages 0
        protection: (0, 1931, 129191, 129191)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 12
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 13
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 14
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 15
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 16
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 17
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 18
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 19
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 20
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 21
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 22
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 23
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 24
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 25
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 26
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 27
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 28
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 29
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 30
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
    cpu: 31
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 12
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         1
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     110046
        min      7833
        low      17624
        high     19582
        scanned  0
        spanned  1044480
        present  494376
    nr_free_pages 110046
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     9738
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   9695
    numa_other   43
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 127260, 127260)
  pagesets
    cpu: 0
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 4
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 6
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 8
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 12
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 14
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 28
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 60
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         4096
  inactive_ratio:    3
Node 0, zone   Normal
  pages free     2036613
        min      516243
        low      1161546
        high     1290607
        scanned  0
        spanned  33030144
        present  32578560
    nr_free_pages 2036613
    nr_inactive_anon 4511
    nr_active_anon 262
    nr_inactive_file 14046304
    nr_active_file 14206134
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 2802
    nr_mapped    728
    nr_file_pages 28253303
    nr_dirty     96376
    nr_writeback 0
    nr_slab_reclaimable 1740250
    nr_slab_unreclaimable 518722
    nr_page_table_pages 1323
    nr_kernel_stack 619
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 486599
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     1
    numa_hit     71200515152
    numa_miss    2298983520
    numa_foreign 18750545338
    numa_interleave 60017
    numa_local   71200327746
    numa_other   2299170926
    nr_anon_transparent_hugepages 3
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 148
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 75
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 174
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 162
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 181
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 178
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 142
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 183
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 175
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 214
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 49
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 164
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 167
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 174
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 113
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 176
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 166
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 165
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 60
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 181
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 165
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 184
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 2
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         1048576
  inactive_ratio:    35
Node 1, zone   Normal
  pages free     6797220
        min      524437
        low      1179983
        high     1311092
        scanned  0
        spanned  33554432
        present  33095680
    nr_free_pages 6797220
    nr_inactive_anon 10776
    nr_active_anon 1964
    nr_inactive_file 12050770
    nr_active_file 12364474
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 10537
    nr_mapped    3388
    nr_file_pages 24416518
    nr_dirty     41224
    nr_writeback 0
    nr_slab_reclaimable 1120374
    nr_slab_unreclaimable 531249
    nr_page_table_pages 1346
    nr_kernel_stack 996
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 8957
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     1
    numa_hit     27356716669
    numa_miss    18750545338
    numa_foreign 2298983520
    numa_interleave 60079
    numa_local   27355207924
    numa_other   18752054083
    nr_anon_transparent_hugepages 3
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 135
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 159
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 184
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 155
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 185
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 39
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 139
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 118
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 171
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 138
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 175
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 69
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 180
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 180
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 7
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 185
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 28
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 169
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 200
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 153
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 28
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 182
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 139
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 131
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 60
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 160
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 154
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 102
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         34078720
  inactive_ratio:    35
Comment by Bruno Faccini (Inactive) [ 15/Dec/15 ]

Well, according the evolution of the memory counters/stats, the free memory thresholds seem to work ok. So you may encounter a very special situation at the time of the OOMs. Hope that we will be able to learn again from a crash-dump.

Comment by Daire Byrne (Inactive) [ 04/Jan/16 ]

The server became pretty unresponsive this morning. I tried the crash dump but kdump wasn't configured properly to catch it

For what it's worth I grabbed another zoneinfo while in this state (<8G free). Trying to drop caches had essentially no effect and didn't free up any memory.

bmds1 /root # numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 131026 MB
node 0 free: 2463 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 131072 MB
node 1 free: 2049 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
bmds1 /root # cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3934
        min      60
        low      135
        high     150
        scanned  0
        spanned  4095
        present  3835
    nr_free_pages 3934
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     1
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   1
    nr_anon_transparent_hugepages 0
        protection: (0, 1931, 129191, 129191)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 12
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 13
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 14
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 15
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 16
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 17
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 18
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 19
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 20
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 21
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 22
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 23
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 24
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 25
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 26
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 27
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 28
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 29
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 30
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
    cpu: 31
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 2
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         1
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     110108
        min      7833
        low      17624
        high     19582
        scanned  0
        spanned  1044480
        present  494376
    nr_free_pages 110108
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     29013
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   28306
    numa_other   707
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 127260, 127260)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096
  inactive_ratio:    3
Node 0, zone   Normal
  pages free     517075
        min      516243
        low      1161546
        high     1290607
        scanned  288
        spanned  33030144
        present  32578560
    nr_free_pages 517075
    nr_inactive_anon 107
    nr_active_anon 97
    nr_inactive_file 14924374
    nr_active_file 14936877
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 184
    nr_mapped    378
    nr_file_pages 29861476
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 1493381
    nr_slab_unreclaimable 681908
    nr_page_table_pages 1240
    nr_kernel_stack 624
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 42937040
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    numa_hit     156088061106
    numa_miss    4460728532
    numa_foreign 150915871653
    numa_interleave 60017
    numa_local   156087747488
    numa_other   4461042150
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         1048576
  inactive_ratio:    35
Node 1, zone   Normal
  pages free     525475
        min      524437
        low      1179983
        high     1311092
        scanned  2624
        spanned  33554432
        present  33095680
    nr_free_pages 525475
    nr_inactive_anon 2427
    nr_active_anon 2424
    nr_inactive_file 15429332
    nr_active_file 15429025
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 4709
    nr_mapped    1574
    nr_file_pages 30859896
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 970998
    nr_slab_unreclaimable 523574
    nr_page_table_pages 1526
    nr_kernel_stack 1082
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 156051144
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     1
    numa_hit     88088497663
    numa_miss    150915871653
    numa_foreign 4460728532
    numa_interleave 60079
    numa_local   88081349155
    numa_other   150923020161
    nr_anon_transparent_hugepages 0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 1
              count: 27
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 2
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 3
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 4
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 6
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 7
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 9
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 12
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 13
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 14
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 15
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 16
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 18
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 24
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 25
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 26
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 27
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 28
              count: 2
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 29
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 30
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
    cpu: 31
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 125
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         34078720
  inactive_ratio:    35
Comment by Bruno Faccini (Inactive) [ 07/Jan/16 ]

Hello Daire,
If I correctly understand your last input, this is the MDS/MDT functionality that "became pretty unresponsive", but not the hosting MDS/node itself ?
Was there any interesting msg in node's dmesg/syslog ?
It is too bad that have not been able to get/force a crash-dump due to kdump not configured, but did you fix this now ? And then, it would have been useful to request a full threads stacks trace (Alt+SysRq+T or equivalent).
Also you indicate "<8G free", do you refer to the "vm.[min,extra]_free_kbytes = 4194304" parameters and that they are still in place?

Comment by Daire Byrne (Inactive) [ 09/Feb/16 ]

Sorry for the lase reply, we have been struggling to get a useful vmcore. We trimmed out all the cache and buffers from the kdump but it still won't fit on the root drive (400G RAM). But this actually highlighted something I don't fully understand. When the server is getting close to falling over (when less than ~10G free), running "sync ; echo 3 > /proc/sys/vm/drop_caches" does not seem to free up any memory at all.

A server that has been up for a couple of days will at least free up half the 400G memory after issuing that command.

The actual server node becomes sluggish. It takes ages to ssh in and the console is extremely slow. There isn't anything else useful in the logs other than the resulting Lustre timeouts. And yes, we are stll using "vm.[min,extra]_free_kbytes = 4194304" but we know the server is getting close to crashing when there is less than ~10G free.

I'll try to get the thread stacks next time too.

Comment by Daire Byrne (Inactive) [ 26/Jan/17 ]

Just to follow up on this - we upgraded from v2.4.1 -> v2.8 a couple of months ago and have not seen this issue since (with the same workload). We also took the opportunity to upgrade from RHEL6 -> RHEL7 at the same time.

So one or other of these resolved the issue. Before the upgrades, we still suffered from this problem once a week but simply rebooted when it occurred.

Comment by Peter Jones [ 26/Jan/17 ]

Glad to hear it. Thanks for the update Daire.

Generated at Sat Feb 10 01:49:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.