Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
I have been trying to tune the knob of the max_rpcs_in_progress parameter on the MDTs in my system. Environment:
64 MDTs, one per MDS
64 OSTs, one per OSS
512 clients
All using 2.15.51
On the MDTs, the osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:
V-1: Rank 0 Line 2565 Operation Duration Rate V-1: Rank 0 Line 2566 --------- -------- ---- V-1: Rank 0 Line 1957 main: * iteration 1 * V-2: Rank 0 Line 1966 main (for j loop): making o.testdir, '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0' V-1: Rank 0 Line 1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0... V-2: Rank 0 Line 1889 Making directory '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/' V-1: Rank 0 Line 1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1... V-1: Rank 0 Line 2033 V-1: main: Tree creation : 2.250 sec, 0.444 ops/sec delaying 30 seconds . . . application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328 srun: error: mo0627: tasks 7328-7343: Exited with exit code 255 application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329 application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330 application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331 application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332 application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333
The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.
It seems like I may have narrowed this down to commit 23028efcae LU-6864 osp: manage number of modify RPCs in flight. If I revert this commit, I cannot reproduce the error.
Attachments
Issue Links
- is related to
-
LU-6864 DNE3: Support multiple modify RPCs in flight for MDT-MDT connection
- Resolved