[LU-16126] Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have been trying to tune the knob of the max_rpcs_in_progress parameter on the MDTs in my system. Environment:
64 MDTs, one per MDS
64 OSTs, one per OSS
512 clients
All using 2.15.51

On the MDTs, the osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:

V-1: Rank   0 Line  2565    Operation               Duration              Rate
V-1: Rank   0 Line  2566    ---------               --------              ----
V-1: Rank   0 Line  1957 main: * iteration 1 *
V-2: Rank   0 Line  1966 main (for j loop): making o.testdir, '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0'
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0...
V-2: Rank   0 Line  1889 Making directory '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/'
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1...
V-1: Rank   0 Line  2033 V-1: main:   Tree creation     :          2.250 sec,          0.444 ops/sec
delaying 30 seconds . . .
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328
srun: error: mo0627: tasks 7328-7343: Exited with exit code 255
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333

The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.

It seems like I may have narrowed this down to commit 23028efcae LU-6864 osp: manage number of modify RPCs in flight. If I revert this commit, I cannot reproduce the error.

Attachments

Issue Links

is related to

LU-6864 DNE3: Support multiple modify RPCs in flight for MDT-MDT connection

Resolved

Activity

People

Assignee:: Andreas Dilger

Reporter:: Petros Koutoupis

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Aug/22 1:04 PM

Updated:: 11/Oct/24 10:19 PM

Resolved:: 11/Oct/24 10:19 PM