[LU-16454] can't set max_mod_rpcs_in_flight > 8 Created: 07/Jan/23  Updated: 07/Jul/23  Resolved: 14/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Vitaliy Kuznetsov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Gantt End to End
has to be finished together with LU-16558 Сan't set max_mod_rpcs_in_flight > 8 ... Resolved
Related
is related to LU-14144 get and set Lustre module parameters ... Open
is related to LU-13503 allow setting larger max_mod_rpcs_in_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I am trying to increase mdc.*.max_mod_rpcs_in_flight to grater than 8 but I get an error.

 

# lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_rpcs_in_flight=128
mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_rpcs_in_flight=128

#lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_mod_rpcs_in_flight=127
error: set_param: setting /sys/fs/lustre/mdc/fs1-MDT0000-mdc-ffff902107a5e000/max_mod_rpcs_in_flight=127: Numerical result out of range

# lctl get_param version
version=2.15.1



 Comments   
Comment by Mahmoud Hanafi [ 08/Jan/23 ]

Here is the error in debug logs

 

00000020:00020000:0.0F:1673137999.151510:0:2742:0:(genops.c:2175:obd_set_max_mod_rpcs_in_flight()) fs1-MDT0000-mdc-ffff902107a5e000: can't set max_mod_rpcs_in_flight=9 higher than ocd_maxmodrpcs=8 returned by the server at connection

Comment by Mahmoud Hanafi [ 08/Jan/23 ]

I figured out the issue. It was module setting on the server. The documentation should be update to state that the server side module param should be updated first.

 

Comment by Peter Jones [ 09/Jan/23 ]

Vitaliy

We discussed this during the triage call today. Andreas has some suggestions of how to address this issue that he will share and then could you please follow up and implement?

Thanks

Peter

Comment by Andreas Dilger [ 10/Jan/23 ]

Vitaliy, in my previous investigation of a similar issue in LU-14144 I couldn't find any good reason in the code or commit history why max_mod_rpcs_per_client was specifically a module parameter on the server and not a regular sysfs parameter. There doesn't appear to be any runtime dependency on this value (i.e. it doesn't define a static number of slots for the per-client replies or anything), and the only thing it is used for is to pass the limit to the client. For the same reason, there also doesn't appear to be a particularly hard limitation why the client cannot change and exceed the server-provided parameter, except to avoid overloading the server with too many RPCs at once, but that may also be true of the current limit with a larger number of clients, no different than "max_rpcs_in_flight".

It seems reasonable to add a per-MDT "max_mod_rpcs_in_flight" tunable parameter to lustre/mdt/mdt_lproc.c so that it can be set with "lctl set_param" at runtime, for example like async_commit_count. The global max_mod_rpcs_per_client parameter should be used as the initial value, and add "(deprecated)" to the module description in mdt_handler.c.

Mahmoud, the console error message printed when the client limit is reached is "myth-MDT0000-mdc-ffff979380fc1800: can't set max_mod_rpcs_in_flight=32 higher than ocd_maxmodrpcs=8 returned by the server at connection" but I agree this isn't totally clear. Instead of reporting "ocd_maxmodrpcs" (which is an internal field name) it should report the new "mdt.myth-MDT0000.max_mod_rpcs_in_flight" parameter, which would steer the admin to the right location to change this value. However, in the current implementation it would still be necessary to unmount/remount (or at least force a client reconnection) if this parameter is changed.

The main question is whether there is any value for the MDS to "limit" the value that can be set by the client (which is not done for max_rpcs_in_flight or most other parameters) or if the client should be able set this larger than the default value the MDT returned (maybe some upper limit like 4x or 8x the MDT limit)? That would allow something like "lctl set_param -P ..max_mod_rpcs_in_time" to affect both the clients and servers.

Comment by Vitaliy Kuznetsov [ 11/Jan/23 ]

adilger Ok, I'll start working on a solution to this ticket.
Thanks

Comment by Gerrit Updater [ 24/Jan/23 ]

"Vitaliy Kuznetsov <vkuznetsov@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49749
Subject: LU-16454 component: Add a per-MDT "max_mod_rpcs_in_flight"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 23463fee16abd0821b95129b333c31f354cf8a94

Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49749/
Subject: LU-16454 mdt: Add a per-MDT "max_mod_rpcs_in_flight"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f16c31ccd91d66caba69d3ceea6a61c1682df59e

Comment by Alex Zhuravlev [ 14/Feb/23 ]

just got this locally:

== conf-sanity test 90b: check max_mod_rpcs_in_flight is enforced after update ========================================================== 08:08:05 (1676362085)
start mds service on tmp.8Qi9eBagDy
Loading modules from /mnt/build/lustre/tests/..
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
Starting mds1: -o localrecov  lustre-mdt1/mdt1 /mnt/lustre-mds1
Started lustre-MDT0000
start mds service on tmp.8Qi9eBagDy
Starting mds2: -o localrecov  lustre-mdt2/mdt2 /mnt/lustre-mds2
Started lustre-MDT0001
tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
start ost1 service on tmp.8Qi9eBagDy
Starting ost1: -o localrecov  lustre-ost1/ost1 /mnt/lustre-ost1
Started lustre-OST0000
tmp.8Qi9eBagDy: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
mount lustre  on /mnt/lustre.....
Starting client: tmp.8Qi9eBagDy:  -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre
mdc.lustre-MDT0000-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=1
max_mod_rpcs_in_flight set to 1
creating 2 files ...
fail_loc=0x159
launch 0 chmod in parallel ...
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity1/file-1 has perms 0600 OK
fail_loc=0x159
launch 1 chmod in parallel ...
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity1/file-2 has perms 0644 OK
mdc.lustre-MDT0001-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=5
max_mod_rpcs_in_flight set to 5
creating 6 files ...
fail_loc=0x159
launch 4 chmod in parallel ...
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity2/file-5 has perms 0600 OK
fail_loc=0x159
launch 5 chmod in parallel ...
fail_loc=0
launch 1 additional chmod in parallel ...
/mnt/lustre/d90b.conf-sanity2/file-6 has perms 0644 OK
mdt_max_mod_rpcs_in_flight is 8 8
umount lustre on /mnt/lustre.....
Stopping client tmp.8Qi9eBagDy /mnt/lustre (opts:)
error: set_param: setting /sys/fs/lustre/mdt/lustre-MDT0000/max_mod_rpcs_in_flight=16: Numerical result out of range
mount lustre  on /mnt/lustre.....
Starting client: tmp.8Qi9eBagDy:  -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre
mdc.lustre-MDT0000-mdc-ffff8a784a138000.max_rpcs_in_flight=17
error: set_param: setting /sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff8a784a138000/max_mod_rpcs_in_flight=16: Numerical result out of range
 conf-sanity test_90b: @@@@@@ FAIL: Unable to set max_mod_rpcs_in_flight to 16 
  Trace dump:
  = ./../tests/test-framework.sh:6549:error()
  = conf-sanity.sh:7139:check_max_mod_rpcs_in_flight()
  = conf-sanity.sh:7291:test_90b()
  = ./../tests/test-framework.sh:6887:run_one()
  = ./../tests/test-framework.sh:6937:run_one_logged()
  = ./../tests/test-framework.sh:6773:run_test()
  = conf-sanity.sh:7299:main()
Comment by Vitaliy Kuznetsov [ 15/Feb/23 ]

Minor fix for limit in LU-16558

Generated at Sat Feb 10 03:27:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.