Massive directory metadata operation performance decrease (LU-14146)

[LU-6864] DNE3: Support multiple modify RPCs in flight for MDT-MDT connection Created: 17/Jul/15  Updated: 30/Aug/22  Resolved: 06/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Technical task Priority: Minor
Reporter: Gregoire Pichon Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: dne3, patch, performance, recovery

Attachments: PNG File mdtest-32kb-dom.png     PNG File mdtest-32kb-ost.png     PNG File mdtest-dir.png     PNG File mdtest-zero-size.png    
Issue Links:
Related
is related to LU-14761 DNE2 Metadata degradation Open
is related to LU-16065 replay-single test_81a: rm remote dir... Open
is related to LU-16126 Lustre 2.15.51 mdtest fails with MPI_... Closed
is related to LU-5319 Support multiple slots per client in ... Resolved
is related to LU-9436 DNE2 - performance improvement with w... Open
is related to LU-11999 DNE performance improvement Resolved
is related to LU-6753 Fix several minor improvements to mul... Resolved
is related to LU-12125 Allow parallel rename of regular files Resolved
Rank (Obsolete): 9223372036854775807

 Description   

This feature is a complement to the LU-5319 that implements the support of multiple modify RPCs in flight for MDC-MDT connection.

It will improve the performance of modify metadata cross-MDT operations, while ensuring the correctness of recovery handling (request resend, and MDT recovery).



 Comments   
Comment by Gregoire Pichon [ 17/Jul/15 ]

I have updated the patch http://review.whamcloud.com/#/c/14375/ to implement this feature.

Comment by Peter Jones [ 17/Jul/15 ]

Thanks Gregoire. This work is queued up for 2.9

Comment by Gerrit Updater [ 15/May/18 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/32412
Subject: LU-6864 mdc: move RPC semaphore code to lustre/osp
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4ba17e20f890b52141bc742500199da3c682c8bd

Comment by Andreas Dilger [ 15/May/18 ]

Note that the above patch is not a replacement for Grégoire's patch. It is a code cleanup to move the MDC semaphore to the lustre/osp directory, and when http://review.whamcloud.com/14375 is updated that code should probably be removed. In the meantime we avoid allocating the mdc_rpc_lock for each client->MDT connection.

Comment by Gerrit Updater [ 01/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32412/
Subject: LU-6864 mdc: move RPC semaphore code to lustre/osp
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 040ca57f2ebd46b3a50bbec279839286ff20bffc

Comment by James A Simmons [ 06/Jun/21 ]

The 14375 patch has been revived. Currently I'm collecting data to see if it helps with DNE2 scaling issues.

Comment by James A Simmons [ 15/Jun/21 ]

https://www.opensfs.org/wp-content/uploads/Evaluation-of-DoM-SNE-scaling_Simmons_revised051821.pdf

Looking at the data compared to my LUG talk since it is the same hardware we see improvements in file removals and stats. I do see a regression in reads of small files. I don't know if that is due to recent changes or the patch itself at this point. This is using the new default settings of max_mod_rpcs_in_flight = 8. I need to bump it up to see if I can get more out of it.

Comment by Andreas Dilger [ 16/Jun/21 ]

James, thanks for testing this out. The graphs would be much easier to compare if they had the before/after results for the same tests on a single graph.

As for max_mod_rpcs_in_flight for the MDS, I think it is totally reasonable to increase this higher than 8, since we don't want clients to bottleneck on the MDSes if they are busy handling lots of client requests. The MDS has max_rpcs_in_flight=512, so it wouldn't be unreasonable to have the MDS tune this to match mds.MDS.mdt_out.threads_max.

Comment by Gerrit Updater [ 08/Mar/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46735
Subject: LU-6864 osp: manage number of modify RPCs in flight
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dce4b2800030f981b91249227daee40966ae4afd

Comment by Gerrit Updater [ 06/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/14375/
Subject: LU-6864 osp: manage number of modify RPCs in flight
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 23028efcae01bf1274a68fd2dd379fbb33300e82

Comment by Gerrit Updater [ 16/Jun/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47656
Subject: LU-6864 tests: properly skip sanity/245b in interop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91fa883754380b935b92d6e01fe5b1063483cc0c

Comment by Gerrit Updater [ 20/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47656/
Subject: LU-6864 tests: properly skip sanity/245b in interop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c4ebdc96061ae9c24ac471b2866f2087bc3e98d4

Generated at Sat Feb 10 02:03:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.