[LU-9885] Huge amount of costs in ptlrpc_wait_event() at file creation Created: 16/Aug/17  Updated: 01/Sep/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14976 Changing tbf policy induces high CPU ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After metadata performance improvements on LU-9796, we are hitting Lustre level bottleneck at File creation for scale.
collected cost of all Lustre functions at file creation, ptlrpc_wait_event() took huge a mount of time cost.

[root@mds05 ~]# /work/ihara/perf-tools/bin/funccost -L
Start setting lustre functions...
Setting lustre functions done...
Tracing "[lustre modules functions]"... Ctrl-C to end.
/work/ihara/perf-tools/bin/funccost: line 158: tracing_enabled: Permission denied
^C
FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)       
ptlrpc_wait_event              1299908760           1745217      744.84      
ptlrpc_server_handle_request   279979924            876419       319.46      
tgt_request_handle             268911684            637004       422.15      
tgt_enqueue                    230880245            318456       725.00      
ldlm_lock_enqueue              230872744            1273842      181.24      
ldlm_handle_enqueue0           230791830            318456       724.72      
mdt_intent_policy              224904050            318456       706.23      
mdt_intent_opc                 223658324            318456       702.32      
mdt_intent_reint               223196495            318456       700.87      

Why is ptlrpc_wait_event() taking a ot costs here?



 Comments   
Comment by Andreas Dilger [ 16/Aug/17 ]

Is this CPU time or elapsed wall clock time? The client threads are all sleeping in ptlrpc_wait_event() for the reply from the MDS. Having more RPCs in flight means that more threads will be waiting in parallel, even if it increases the aggregate create rate.

If this is elapsed wall-clock time, then that is normal and not much can be done about it. If this is CPU time, then it might be made more efficient.

Comment by Wang Shilong (Inactive) [ 16/Aug/17 ]

Hi Andreas,

We used ftrace function profile which should not include sleep time, profile results
here should be CPU time.

Comment by John Hammond [ 16/Aug/17 ]

Is this on 2.7.0 or a descendant? Could you describe the exact lustre revision and test used here?

Comment by Peter Jones [ 17/Aug/17 ]

Do you see the same behaviour with Lustre 2.10?

Comment by Andreas Dilger [ 17/Aug/17 ]

I was looking into the ptlrpc_wait_event() code, and the only part that can take a lot of CPU time is:

        l_wait_event_exclusive_head(svcpt->scp_waitq,
                                ptlrpc_thread_stopping(thread) ||
                                ptlrpc_server_request_incoming(svcpt) ||
                                ptlrpc_server_request_pending(svcpt, false) ||
                                ptlrpc_rqbd_pending(svcpt) ||
                                ptlrpc_at_check(svcpt), &lwi);

and of all those checks, the only one that is not just a simple variable/flag check is ptlrpc_server_request_pending(). It looks like this may be hooking into the NRS code. Is it possible you have some NRS TBF patches/policies on this system that are causing this function to be slow? Alternately, it may be that if there is memory pressure on the MDS it may be that the ptlrpc_wait_event() is just being called a lot due to the 0.1s timeout being set in this case:

                if (ptlrpc_rqbd_pending(svcpt) &&
                    ptlrpc_server_post_idle_rqbds(svcpt) < 0) {
                        /* I just failed to repost request buffers.
                         * Wait for a timeout (unless something else
                         * happens) before I try again */
                        svcpt->scp_rqbd_timeout = cfs_time_seconds(1) / 10;
                        CDEBUG(D_RPCTRACE, "Posted buffers: %d\n",
                               svcpt->scp_nrqbds_posted);
                }

What would be useful here is a flame graph which shows where the CPU time is actually being spent.[

http://www.brendangregg.com/flamegraphs.html%20flame%20graph]
Comment by Patrick Farrell (Inactive) [ 18/Aug/17 ]

For what it's worth, I often see a lot of time spent here when dealing with high rates of small messages. In my experience, when looking more closely, the culprit is the time spent doing the sleep/wake. My data is client side, but the code is the same.

I'm not familiar with the tool you're using, does it break down the time spent in that function further (especially in what that function calls)? If not, I'd suggest using perf record/perf report (the flame graphs Andreas mentioned are a way to visualize the sort of data it generates) - it makes it easy to see where time is spent in a function and functions called from it. If you are just charting time spent, then if the time is spread out between a few sub calls, they could be far down the list but together account for the time in that function.

Comment by Shuichi Ihara (Inactive) [ 18/Aug/17 ]

Thanks Andreas and Patrick
Yes, we just used a simple tool perf-tool to confirm whether if there are any places where are a lot of CPU consuming. Agreed, for now, we need to break down of each function for each CPU time. will try perftool to find what exact places for it.

Generated at Sat Feb 10 02:30:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.