[LU-16126] Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients Created: 30/Aug/22  Updated: 23/Jan/23  Resolved: 07/Sep/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Petros Koutoupis Assignee: Andreas Dilger
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6864 DNE3: Support multiple modify RPCs in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have been trying to tune the knob of the max_rpcs_in_progress parameter on the MDTs in my system. Environment:
64 MDTs, one per MDS
64 OSTs, one per OSS
512 clients
All using 2.15.51

On the MDTs, the osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:

V-1: Rank   0 Line  2565    Operation               Duration              Rate
V-1: Rank   0 Line  2566    ---------               --------              ----
V-1: Rank   0 Line  1957 main: * iteration 1 *
V-2: Rank   0 Line  1966 main (for j loop): making o.testdir, '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0'
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0...
V-2: Rank   0 Line  1889 Making directory '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/'
V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1...
V-1: Rank   0 Line  2033 V-1: main:   Tree creation     :          2.250 sec,          0.444 ops/sec
delaying 30 seconds . . .
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328
srun: error: mo0627: tasks 7328-7343: Exited with exit code 255
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333

 
The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.

It seems like I may have narrowed this down to commit 23028efcae LU-6864 osp: manage number of modify RPCs in flight. If I revert this commit, I cannot reproduce the error.



 Comments   
Comment by Olaf Faaland [ 30/Aug/22 ]

Hi Petros,
You state you tested with Lustre 2.15, and you mention that the problem seems to be commit:

  • 23028efcae LU-6864 osp: manage number of modify RPCs in flight

Since this commit landed to master after 2.15.0, that confused me and may confuse others (when someone says "2.15" I assume they mean something on branch b2_15). If you change the description and the summary of this ticket to say "2.15.50" or "master post-2.15.0" that would help - thanks.

Comment by James A Simmons [ 30/Aug/22 ]

Do you have kernel logs? I do wonder if this is related to LU-16065

Comment by Andreas Dilger [ 30/Aug/22 ]

Note that we don't regularly test with 64 MDTs, so there may be some kind of thread starvation happening of all of the MDTs are sending RpcS to each other and the service thread count is not high enough.

Even before debug logs, getting the console logs from the client and MDS nodes would be useful (eg. if clients are being evicted, stacks are dumped, etc.)

Comment by Petros Koutoupis [ 07/Sep/22 ]

I went back to collect more data and unfortunately, I am unable to reproduce the original issue. Closing this ticket unless I can observe the problem again.

Comment by Nathan Rutman [ 16/Nov/22 ]

Another problem: you can't get it back to 0:

[root@cslmo4902 ~]# lctl get_param *.*.max_rpcs_in_progress
osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
osp.testfs-MDT0001-osp-MDT0000.max_rpcs_in_progress=0
[root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
[root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
error: set_param: setting /sys/fs/lustre/osp/testfs-MDT0000-osp-MDT0001/max_rpcs_in_progress=0: Numerical result out of range
Comment by Petros Koutoupis [ 16/Nov/22 ]

Nathan, I did notice that but forgot about that quirk until now. Thank you for refreshing my memory.

Comment by James A Simmons [ 16/Nov/22 ]

That is the same behavior for mdc.*.max_rpcs_in_progress. Its due to the test:

  if (max > OBD_MAX_RIF_MAX || max < 1)
               return -ERANGE;

in obd_set_max_rpcs_in_flight(). Setting it back to 1 is the default setting. Zero would mean don't send anything which we don't want.

Comment by Andreas Dilger [ 16/Nov/22 ]

So maybe the issue here is that the max_rpcs_in_progress value shouldn't show "0" initially?

Comment by James A Simmons [ 16/Nov/22 ]

Yeah the initial value is wrong.

Comment by Gerrit Updater [ 17/Nov/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49182
Subject: LU-16126 ldlm: set default rpcs_in_flight to one
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee2925431a7c2fb0559fd02a4e10a9b4a85f57aa

Comment by James A Simmons [ 23/Jan/23 ]

I took another look at this and in my testing I see:

/usr/src/lustre-2.15.52/lustre/tests# ../utils/lctl get_param ..max_rpcs_*
mdc.lustre-MDT0000-mdc-ffff997ccd58f000.max_rpcs_in_flight=8
osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8
osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096
osc.lustre-OST0000-osc-ffff997ccd58f000.max_rpcs_in_flight=8
osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8
osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096
osc.lustre-OST0001-osc-ffff997ccd58f000.max_rpcs_in_flight=8
osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8
osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096
osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8
osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096

Nathan what version of Lustre you are running that you see a setting of zero?

 

Generated at Sat Feb 10 03:24:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.