[LU-13365] spin_lock in after_reply() eat up most of cpu Created: 17/Mar/20  Updated: 28/Jun/20  Resolved: 28/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

master


Attachments: File ior-fast-client.svg     File ior-slow-client.svg    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There are two clients, but one of client was 25% slower write than other client. According to an framegraph on CPU time of each client, client spent amount of CPU time with spin_lock at after_reply(). However, fast client didn't show that cpu time on after_reply().
The workload is simple 1MB, FPP with IOR. (mpirun -np 16 ior -t 1m -b 16g -e -F -vv -o /fast/file -w)
Here is two client's node information and performance results.

fast client (2 x Platinum 8160 CPU, 192GB memory, 1 x IB-EDR)
Max Write: 11219.97 MiB/sec (11765.00 MB/sec)
slow client (1 x Gold 5218 CPU, 96GB memory, 1 x IB-HDR100)
Max Write: 9278.14 MiB/sec (9728.84 MB/sec)


 Comments   
Comment by Wang Shilong (Inactive) [ 17/Mar/20 ]

Some thoughts on the problem, it looks imp_lock is hot.

The good start could split imp_lock, etc use seperated spinlock to protect lists head, since currently it seems imp_lock protect many information.

Another point is this seems still global list here, it might be possible that split lists head to several? even more potential per-cpu list head. And we could merge these list_head when replaying happen, but this might break some optimizations that we need do for recovery.

Comment by Wang Shilong (Inactive) [ 17/Mar/20 ]

just notice one of main difference is it looks there are more after_reply() samples with slower clients. and double ptlrpcd() called in slower clients, it might be related to some client LNET or cpu partitions conf?

This might help explain why slow client having more contention problems with imp_lock.

Comment by Gerrit Updater [ 18/Mar/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37969
Subject: LU-13365 ldlm: check slv and limit before updating
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4ac84f7dbe85a82c3a41d11e1d0fa28350992af

Comment by Shuichi Ihara [ 18/Jun/20 ]

I ddin't see big improvements from patch https://review.whamcloud.com/37969 at this point. rather than it, enough max_cached_mb helps regardless with/without patch.

1 x client, Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz, 96GB RAM, 1 x HDR100

# mpirun --allow-run-as-root -np 16 /work/tools/bin/ior -t 1m -b 16g -e -F -vv -o /ai400/file -w -i 3 -d 10

max_cached_mb=47549 (default)
master without patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     9402       16777216   1024.00    0.002728   27.88      0.664711   27.88      0   
write     9107       16777216   1024.00    0.002969   28.79      1.86       28.79      1   
write     9463       16777216   1024.00    0.002562   27.70      1.09       27.70      2   
Max Write: 9462.51 MiB/sec (9922.16 MB/sec)

master with patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     9289       16777216   1024.00    0.002968   28.22      1.45       28.22      0   
write     9392       16777216   1024.00    0.002479   27.91      1.87       27.91      1   
write     9227       16777216   1024.00    0.002484   28.41      1.91       28.41      2   
Max Write: 9391.89 MiB/sec (9848.11 MB/sec)

max_cached_mb=80000
master without patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     11336      16777216   1024.00    0.002728   23.12      0.781030   23.12      0   
write     11154      16777216   1024.00    0.002595   23.50      0.665123   23.50      1   
write     11377      16777216   1024.00    0.002804   23.04      1.12       23.04      2   
Max Write: 11377.33 MiB/sec (11930.00 MB/sec)

master with patch

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
write     11316      16777216   1024.00    0.003285   23.17      0.866756   23.17      0   
write     11282      16777216   1024.00    0.002597   23.24      0.355386   23.24      1   
write     11128      16777216   1024.00    0.002604   23.56      1.40       23.56      2   
Max Write: 11315.76 MiB/sec (11865.43 MB/sec)
Comment by Shuichi Ihara [ 18/Jun/20 ]

setup is 8 OSTs, 16M RPC and max_rpcs_in_flight=16. So an formula of required buffer is 8 x 16M x 16 = 2GB. even default max_cached_mb=47549 is still enough size against required buffer. it's understandable more large max_cached_mb can be relaxed, but we don't want to spend huge amount of memory for max_cached_mb.

Comment by Gerrit Updater [ 28/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37969/
Subject: LU-13365 ldlm: check slv and limit before updating
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3116b9e19dc09a4a8b73c2c4733df5fe4596e041

Comment by Peter Jones [ 28/Jun/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:00:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.