[LU-13365] spin_lock in after_reply() eat up most of cpu Created: 17/Mar/20 Updated: 28/Jun/20 Resolved: 28/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
There are two clients, but one of client was 25% slower write than other client. According to an framegraph on CPU time of each client, client spent amount of CPU time with spin_lock at after_reply(). However, fast client didn't show that cpu time on after_reply(). fast client (2 x Platinum 8160 CPU, 192GB memory, 1 x IB-EDR) Max Write: 11219.97 MiB/sec (11765.00 MB/sec) slow client (1 x Gold 5218 CPU, 96GB memory, 1 x IB-HDR100) Max Write: 9278.14 MiB/sec (9728.84 MB/sec) |
| Comments |
| Comment by Wang Shilong (Inactive) [ 17/Mar/20 ] |
|
Some thoughts on the problem, it looks imp_lock is hot. The good start could split imp_lock, etc use seperated spinlock to protect lists head, since currently it seems imp_lock protect many information. Another point is this seems still global list here, it might be possible that split lists head to several? even more potential per-cpu list head. And we could merge these list_head when replaying happen, but this might break some optimizations that we need do for recovery. |
| Comment by Wang Shilong (Inactive) [ 17/Mar/20 ] |
|
just notice one of main difference is it looks there are more after_reply() samples with slower clients. and double ptlrpcd() called in slower clients, it might be related to some client LNET or cpu partitions conf? This might help explain why slow client having more contention problems with imp_lock. |
| Comment by Gerrit Updater [ 18/Mar/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37969 |
| Comment by Shuichi Ihara [ 18/Jun/20 ] |
|
I ddin't see big improvements from patch https://review.whamcloud.com/37969 at this point. rather than it, enough max_cached_mb helps regardless with/without patch. 1 x client, Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 96GB RAM, 1 x HDR100 # mpirun --allow-run-as-root -np 16 /work/tools/bin/ior -t 1m -b 16g -e -F -vv -o /ai400/file -w -i 3 -d 10 max_cached_mb=47549 (default) access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 9402 16777216 1024.00 0.002728 27.88 0.664711 27.88 0 write 9107 16777216 1024.00 0.002969 28.79 1.86 28.79 1 write 9463 16777216 1024.00 0.002562 27.70 1.09 27.70 2 Max Write: 9462.51 MiB/sec (9922.16 MB/sec) master with patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 9289 16777216 1024.00 0.002968 28.22 1.45 28.22 0 write 9392 16777216 1024.00 0.002479 27.91 1.87 27.91 1 write 9227 16777216 1024.00 0.002484 28.41 1.91 28.41 2 Max Write: 9391.89 MiB/sec (9848.11 MB/sec) max_cached_mb=80000 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 11336 16777216 1024.00 0.002728 23.12 0.781030 23.12 0 write 11154 16777216 1024.00 0.002595 23.50 0.665123 23.50 1 write 11377 16777216 1024.00 0.002804 23.04 1.12 23.04 2 Max Write: 11377.33 MiB/sec (11930.00 MB/sec) master with patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 11316 16777216 1024.00 0.003285 23.17 0.866756 23.17 0 write 11282 16777216 1024.00 0.002597 23.24 0.355386 23.24 1 write 11128 16777216 1024.00 0.002604 23.56 1.40 23.56 2 Max Write: 11315.76 MiB/sec (11865.43 MB/sec) |
| Comment by Shuichi Ihara [ 18/Jun/20 ] |
|
setup is 8 OSTs, 16M RPC and max_rpcs_in_flight=16. So an formula of required buffer is 8 x 16M x 16 = 2GB. even default max_cached_mb=47549 is still enough size against required buffer. it's understandable more large max_cached_mb can be relaxed, but we don't want to spend huge amount of memory for max_cached_mb. |
| Comment by Gerrit Updater [ 28/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37969/ |
| Comment by Peter Jones [ 28/Jun/20 ] |
|
Landed for 2.14 |