[LU-15725] Client side Mdtest File Read Regression introduced with fix for LU-11623 Created: 06/Apr/22  Updated: 10/May/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Petros Koutoupis Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11623 Allow caching of open-created dentries Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing 2.15 and comparing it to our 2.12 branch, I observed a noticeable file read regression on the client side and after doing a git bisect, I narrowed it down to the patch https://review.whamcloud.com/38763 "LU-11623 llite: hash just created files if lock allows".

After reverting the patch, my read performance was immediately restored but it was at the expense of the huge file stat boost.

File stats
Original (2.12 results): 399893
Before Revert (2.15): 683490 —> +73%
After Revert (2.15): 401637

File Reads
Original (2.12 results) 297644
Before Revert (2.15): 250536 —> -15%
After Revert (2.15): 295096

mdtest script:

#!/bin/bash


NODES=21
PPN=16
PROCS=$(( $NODES * $PPN ))
MDT_COUNT=1
PAUSED=120


# Unique directory #
srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_uniq.out

srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -u -d /mnt/kjlmo13/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_uniq.out 

 



 Comments   
Comment by Andreas Dilger [ 06/Apr/22 ]

Petros, it would be useful if you edited your original description to indicate "git describe" versions for the "Original" and "Before Revert" tests. Is "Original" the commit before the LU-11623 patch, and "After Revert" on master with that patch reverted? Or is "Original" the 2.12.x test results?

Comment by Andreas Dilger [ 06/Apr/22 ]

The first thing to check is whether there is something that is not being done correctly in this case. Unfortunately, the original patch did not show the "File read" results, or it might have been more visible if there was a regression. In some cases, performance issues like this are caused by incorrectly conflicting/cancelling the lock on the client, and it might be possible to "have your lock and read it too" by avoiding the extra cancellation(s) or efficiently handling the cancellation (if needed) with ELC (Early Lock Cancellation).

In situations where there is no single "good answer" for whether the extra lock should be taken or not, it may be that there a weighted history of what is done to the file (e.g. similar to patch https://review.whamcloud.com/46696 "LU-15546 mdt: keep history of mdt_reint_open() lock") so that the performance can be dynamically optimized for the current workload (stat() vs. read() intensive, or "don't grant the extra lock under heavy contention"). IMHO, this is preferable to any kind of static tunable that just enables/disables the extra locking, and will be sub-optimal at one time or another.

Comment by Lai Siyao [ 07/Apr/22 ]

Petros, what are the results of "Directory stat" before and after the revert?

Comment by Oleg Drokin [ 07/Apr/22 ]

There was a follow-on patch https://review.whamcloud.com/#/c/33585/ that was not landed for a variety of reasons, I wonder if it could be tried too.

Comment by Petros Koutoupis [ 07/Apr/22 ]

Andreas,

I modified the description. I hope that clarifies things.

 

Lai,

Directory stats are unchanged in all cases.

Comment by Lai Siyao [ 25/Apr/22 ]

The last patch of LU-11623 https://review.whamcloud.com/#/c/33585/ was updated, will you cherry-pick and try again?

Comment by Petros Koutoupis [ 25/Apr/22 ]

@Lai Siyao,

I cherry picked the patch on top of 2.15.0 RC3 and reran the same tests. Unfortunately, the file read performance looks worse.

2.15.0 RC3 without the patch:

[ ... ]
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File stat                 :     710652.674     680830.320     695315.322      10282.708
   File read                 :     267242.290     211957.110     243331.807      20164.563
[ ... ]

2.15.0 RC3 with the patch:

[ ... ]
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File stat                 :     704615.924     665430.996     690638.517      13355.073
   File read                 :     255746.075     194060.211     226496.114      21414.336
[ ... ]
Comment by Lai Siyao [ 29/Apr/22 ]

Petros, https://review.whamcloud.com/#/c/33585/ is updated, local test looks promising.

Comment by Petros Koutoupis [ 09/May/22 ]

With updated patch cherry-picked on top of 2.15:

   File stat                 :     703505.087     689172.890     696705.795       4933.824
   File read                 :     270560.870     217336.834     248256.171      17416.326 

There does not seem to be much difference with 2.15.0 RC3 without the patch. Please refer to my mdtest script above for testing parameters. Thank you for working on this.

Comment by Lai Siyao [ 10/May/22 ]

I did more test, it looks When the total number of file is too large, client can't cache all the locks, then the cached locks won't help. I'll see how to improve this.

Comment by Andreas Dilger [ 10/May/22 ]

Petros, Lai,
has there been any kind of analysis done as to where the read performance is being lost with/without the open lock? Is there an increase of DLM locks/cancellations (MDT or OST), extra RPCs being sent, overhead in the VFS, delay in cancelling the DLM lock that increases latency on the mdtest read operation, other?

Collecting a flame graph during the test on the client and server with/without the open cache would definitely help isolate where the time is being spent. Initially I thought it might relate to the delay in cancelling the open lock when a second client node is reading the file, and that hurts read performance (either because of the extra lock cancel, or possibly delayed flushing due to write cache). However, there is a 120s sleep between phases, and I didn't see the "mdtest -N stride" option being used to force file access from a different node, so reads should be local to the node that wrote the file.

There are only about 50k files and 1.6GB of data being created on each client, so this shouldn't exceed the client cache size, and reads should be "free" in this case.

Generated at Sat Feb 10 03:20:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.