[LU-13365] spin_lock in after_reply() eat up most of cpu - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.14.0
Labels:
None
Environment:
master

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

There are two clients, but one of client was 25% slower write than other client. According to an framegraph on CPU time of each client, client spent amount of CPU time with spin_lock at after_reply(). However, fast client didn't show that cpu time on after_reply().
The workload is simple 1MB, FPP with IOR. (mpirun -np 16 ior -t 1m -b 16g -e -F -vv -o /fast/file -w)
Here is two client's node information and performance results.

fast client (2 x Platinum 8160 CPU, 192GB memory, 1 x IB-EDR)
Max Write: 11219.97 MiB/sec (11765.00 MB/sec)
slow client (1 x Gold 5218 CPU, 96GB memory, 1 x IB-HDR100)
Max Write: 9278.14 MiB/sec (9728.84 MB/sec)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ior-fast-client.svg
92 kB
17/Mar/20 9:26 AM
ior-slow-client.svg
86 kB
17/Mar/20 9:26 AM

Activity

[LU-13365] spin_lock in after_reply() eat up most of cpu

Peter Jones added a comment - 28/Jun/20 2:45 PM

Landed for 2.14

Peter Jones added a comment - 28/Jun/20 2:45 PM Landed for 2.14

Gerrit Updater added a comment - 28/Jun/20 2:47 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37969/
Subject: ~~LU-13365~~ ldlm: check slv and limit before updating
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3116b9e19dc09a4a8b73c2c4733df5fe4596e041

Gerrit Updater added a comment - 28/Jun/20 2:47 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37969/ Subject: LU-13365 ldlm: check slv and limit before updating Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3116b9e19dc09a4a8b73c2c4733df5fe4596e041

Shuichi Ihara added a comment - 18/Jun/20 2:01 AM - edited

setup is 8 OSTs, 16M RPC and max_rpcs_in_flight=16. So an formula of required buffer is 8 x 16M x 16 = 2GB. even default max_cached_mb=47549 is still enough size against required buffer. it's understandable more large max_cached_mb can be relaxed, but we don't want to spend huge amount of memory for max_cached_mb.

Shuichi Ihara added a comment - 18/Jun/20 2:01 AM - edited setup is 8 OSTs, 16M RPC and max_rpcs_in_flight=16. So an formula of required buffer is 8 x 16M x 16 = 2GB. even default max_cached_mb=47549 is still enough size against required buffer. it's understandable more large max_cached_mb can be relaxed, but we don't want to spend huge amount of memory for max_cached_mb.

Shuichi Ihara added a comment - 18/Jun/20 1:42 AM - edited

I ddin't see big improvements from patch https://review.whamcloud.com/37969 at this point. rather than it, enough max_cached_mb helps regardless with/without patch.

1 x client, Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 96GB RAM, 1 x HDR100

# mpirun --allow-run-as-root -np 16 /work/tools/bin/ior -t 1m -b 16g -e -F -vv -o /ai400/file -w -i 3 -d 10

max_cached_mb=47549 (default)
master without patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     9402       16777216   1024.00    0.002728   27.88      0.664711   27.88      0   
write     9107       16777216   1024.00    0.002969   28.79      1.86       28.79      1   
write     9463       16777216   1024.00    0.002562   27.70      1.09       27.70      2   
Max Write: 9462.51 MiB/sec (9922.16 MB/sec)

master with patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     9289       16777216   1024.00    0.002968   28.22      1.45       28.22      0   
write     9392       16777216   1024.00    0.002479   27.91      1.87       27.91      1   
write     9227       16777216   1024.00    0.002484   28.41      1.91       28.41      2   
Max Write: 9391.89 MiB/sec (9848.11 MB/sec)

max_cached_mb=80000
master without patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     11336      16777216   1024.00    0.002728   23.12      0.781030   23.12      0   
write     11154      16777216   1024.00    0.002595   23.50      0.665123   23.50      1   
write     11377      16777216   1024.00    0.002804   23.04      1.12       23.04      2   
Max Write: 11377.33 MiB/sec (11930.00 MB/sec)

master with patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     11316      16777216   1024.00    0.003285   23.17      0.866756   23.17      0   
write     11282      16777216   1024.00    0.002597   23.24      0.355386   23.24      1   
write     11128      16777216   1024.00    0.002604   23.56      1.40       23.56      2   
Max Write: 11315.76 MiB/sec (11865.43 MB/sec)

Shuichi Ihara added a comment - 18/Jun/20 1:42 AM - edited I ddin't see big improvements from patch https://review.whamcloud.com/37969 at this point. rather than it, enough max_cached_mb helps regardless with/without patch. 1 x client, Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 96GB RAM, 1 x HDR100 # mpirun --allow-run-as-root -np 16 /work/tools/bin/ior -t 1m -b 16g -e -F -vv -o /ai400/file -w -i 3 -d 10 max_cached_mb=47549 (default) master without patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 9402 16777216 1024.00 0.002728 27.88 0.664711 27.88 0 write 9107 16777216 1024.00 0.002969 28.79 1.86 28.79 1 write 9463 16777216 1024.00 0.002562 27.70 1.09 27.70 2 Max Write: 9462.51 MiB/sec (9922.16 MB/sec) master with patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 9289 16777216 1024.00 0.002968 28.22 1.45 28.22 0 write 9392 16777216 1024.00 0.002479 27.91 1.87 27.91 1 write 9227 16777216 1024.00 0.002484 28.41 1.91 28.41 2 Max Write: 9391.89 MiB/sec (9848.11 MB/sec) max_cached_mb=80000 master without patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 11336 16777216 1024.00 0.002728 23.12 0.781030 23.12 0 write 11154 16777216 1024.00 0.002595 23.50 0.665123 23.50 1 write 11377 16777216 1024.00 0.002804 23.04 1.12 23.04 2 Max Write: 11377.33 MiB/sec (11930.00 MB/sec) master with patch access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 11316 16777216 1024.00 0.003285 23.17 0.866756 23.17 0 write 11282 16777216 1024.00 0.002597 23.24 0.355386 23.24 1 write 11128 16777216 1024.00 0.002604 23.56 1.40 23.56 2 Max Write: 11315.76 MiB/sec (11865.43 MB/sec)

Gerrit Updater added a comment - 18/Mar/20 7:58 AM

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37969
Subject: ~~LU-13365~~ ldlm: check slv and limit before updating
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4ac84f7dbe85a82c3a41d11e1d0fa28350992af

Gerrit Updater added a comment - 18/Mar/20 7:58 AM Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37969 Subject: LU-13365 ldlm: check slv and limit before updating Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a4ac84f7dbe85a82c3a41d11e1d0fa28350992af

Wang Shilong (Inactive) added a comment - 17/Mar/20 11:58 AM

just notice one of main difference is it looks there are more after_reply() samples with slower clients. and double ptlrpcd() called in slower clients, it might be related to some client LNET or cpu partitions conf?

This might help explain why slow client having more contention problems with imp_lock.

Wang Shilong (Inactive) added a comment - 17/Mar/20 11:58 AM just notice one of main difference is it looks there are more after_reply() samples with slower clients. and double ptlrpcd() called in slower clients, it might be related to some client LNET or cpu partitions conf? This might help explain why slow client having more contention problems with imp_lock.

Wang Shilong (Inactive) added a comment - 17/Mar/20 9:39 AM - edited

Some thoughts on the problem, it looks imp_lock is hot.

The good start could split imp_lock, etc use seperated spinlock to protect lists head, since currently it seems imp_lock protect many information.

Another point is this seems still global list here, it might be possible that split lists head to several? even more potential per-cpu list head. And we could merge these list_head when replaying happen, but this might break some optimizations that we need do for recovery.

Wang Shilong (Inactive) added a comment - 17/Mar/20 9:39 AM - edited Some thoughts on the problem, it looks imp_lock is hot. The good start could split imp_lock, etc use seperated spinlock to protect lists head, since currently it seems imp_lock protect many information. Another point is this seems still global list here, it might be possible that split lists head to several? even more potential per-cpu list head. And we could merge these list_head when replaying happen, but this might break some optimizations that we need do for recovery.

People

Assignee:: Wang Shilong (Inactive)

Reporter:: Shuichi Ihara

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Mar/20 9:30 AM

Updated:: 28/Jun/20 2:45 PM

Resolved:: 28/Jun/20 2:45 PM