[LU-9885] Huge amount of costs in ptlrpc_wait_event() at file creation Created: 16/Aug/17 Updated: 01/Sep/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After metadata performance improvements on [root@mds05 ~]# /work/ihara/perf-tools/bin/funccost -L Start setting lustre functions... Setting lustre functions done... Tracing "[lustre modules functions]"... Ctrl-C to end. /work/ihara/perf-tools/bin/funccost: line 158: tracing_enabled: Permission denied ^C FUNC TOTAL_TIME(us) COUNT AVG(us) ptlrpc_wait_event 1299908760 1745217 744.84 ptlrpc_server_handle_request 279979924 876419 319.46 tgt_request_handle 268911684 637004 422.15 tgt_enqueue 230880245 318456 725.00 ldlm_lock_enqueue 230872744 1273842 181.24 ldlm_handle_enqueue0 230791830 318456 724.72 mdt_intent_policy 224904050 318456 706.23 mdt_intent_opc 223658324 318456 702.32 mdt_intent_reint 223196495 318456 700.87 Why is ptlrpc_wait_event() taking a ot costs here? |
| Comments |
| Comment by Andreas Dilger [ 16/Aug/17 ] | |
|
Is this CPU time or elapsed wall clock time? The client threads are all sleeping in ptlrpc_wait_event() for the reply from the MDS. Having more RPCs in flight means that more threads will be waiting in parallel, even if it increases the aggregate create rate. If this is elapsed wall-clock time, then that is normal and not much can be done about it. If this is CPU time, then it might be made more efficient. | |
| Comment by Wang Shilong (Inactive) [ 16/Aug/17 ] | |
|
Hi Andreas, We used ftrace function profile which should not include sleep time, profile results | |
| Comment by John Hammond [ 16/Aug/17 ] | |
|
Is this on 2.7.0 or a descendant? Could you describe the exact lustre revision and test used here? | |
| Comment by Peter Jones [ 17/Aug/17 ] | |
|
Do you see the same behaviour with Lustre 2.10? | |
| Comment by Andreas Dilger [ 17/Aug/17 ] | |
|
I was looking into the ptlrpc_wait_event() code, and the only part that can take a lot of CPU time is: l_wait_event_exclusive_head(svcpt->scp_waitq,
ptlrpc_thread_stopping(thread) ||
ptlrpc_server_request_incoming(svcpt) ||
ptlrpc_server_request_pending(svcpt, false) ||
ptlrpc_rqbd_pending(svcpt) ||
ptlrpc_at_check(svcpt), &lwi);
and of all those checks, the only one that is not just a simple variable/flag check is ptlrpc_server_request_pending(). It looks like this may be hooking into the NRS code. Is it possible you have some NRS TBF patches/policies on this system that are causing this function to be slow? Alternately, it may be that if there is memory pressure on the MDS it may be that the ptlrpc_wait_event() is just being called a lot due to the 0.1s timeout being set in this case: if (ptlrpc_rqbd_pending(svcpt) && ptlrpc_server_post_idle_rqbds(svcpt) < 0) { /* I just failed to repost request buffers. * Wait for a timeout (unless something else * happens) before I try again */ svcpt->scp_rqbd_timeout = cfs_time_seconds(1) / 10; CDEBUG(D_RPCTRACE, "Posted buffers: %d\n", svcpt->scp_nrqbds_posted); } What would be useful here is a flame graph which shows where the CPU time is actually being spent.[
| |
| Comment by Patrick Farrell (Inactive) [ 18/Aug/17 ] | |
|
For what it's worth, I often see a lot of time spent here when dealing with high rates of small messages. In my experience, when looking more closely, the culprit is the time spent doing the sleep/wake. My data is client side, but the code is the same. I'm not familiar with the tool you're using, does it break down the time spent in that function further (especially in what that function calls)? If not, I'd suggest using perf record/perf report (the flame graphs Andreas mentioned are a way to visualize the sort of data it generates) - it makes it easy to see where time is spent in a function and functions called from it. If you are just charting time spent, then if the time is spread out between a few sub calls, they could be far down the list but together account for the time in that function. | |
| Comment by Shuichi Ihara (Inactive) [ 18/Aug/17 ] | |
|
Thanks Andreas and Patrick |