[LU-16454] can't set max_mod_rpcs_in_flight > 8 Created: 07/Jan/23 Updated: 07/Jul/23 Resolved: 14/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Vitaliy Kuznetsov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
I am trying to increase mdc.*.max_mod_rpcs_in_flight to grater than 8 but I get an error.
# lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_rpcs_in_flight=128 #lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_mod_rpcs_in_flight=127 # lctl get_param version |
| Comments |
| Comment by Mahmoud Hanafi [ 08/Jan/23 ] |
|
Here is the error in debug logs
00000020:00020000:0.0F:1673137999.151510:0:2742:0:(genops.c:2175:obd_set_max_mod_rpcs_in_flight()) fs1-MDT0000-mdc-ffff902107a5e000: can't set max_mod_rpcs_in_flight=9 higher than ocd_maxmodrpcs=8 returned by the server at connection |
| Comment by Mahmoud Hanafi [ 08/Jan/23 ] |
|
I figured out the issue. It was module setting on the server. The documentation should be update to state that the server side module param should be updated first.
|
| Comment by Peter Jones [ 09/Jan/23 ] |
|
Vitaliy We discussed this during the triage call today. Andreas has some suggestions of how to address this issue that he will share and then could you please follow up and implement? Thanks Peter |
| Comment by Andreas Dilger [ 10/Jan/23 ] |
|
Vitaliy, in my previous investigation of a similar issue in LU-14144 I couldn't find any good reason in the code or commit history why max_mod_rpcs_per_client was specifically a module parameter on the server and not a regular sysfs parameter. There doesn't appear to be any runtime dependency on this value (i.e. it doesn't define a static number of slots for the per-client replies or anything), and the only thing it is used for is to pass the limit to the client. For the same reason, there also doesn't appear to be a particularly hard limitation why the client cannot change and exceed the server-provided parameter, except to avoid overloading the server with too many RPCs at once, but that may also be true of the current limit with a larger number of clients, no different than "max_rpcs_in_flight". It seems reasonable to add a per-MDT "max_mod_rpcs_in_flight" tunable parameter to lustre/mdt/mdt_lproc.c so that it can be set with "lctl set_param" at runtime, for example like async_commit_count. The global max_mod_rpcs_per_client parameter should be used as the initial value, and add "(deprecated)" to the module description in mdt_handler.c. Mahmoud, the console error message printed when the client limit is reached is "myth-MDT0000-mdc-ffff979380fc1800: can't set max_mod_rpcs_in_flight=32 higher than ocd_maxmodrpcs=8 returned by the server at connection" but I agree this isn't totally clear. Instead of reporting "ocd_maxmodrpcs" (which is an internal field name) it should report the new "mdt.myth-MDT0000.max_mod_rpcs_in_flight" parameter, which would steer the admin to the right location to change this value. However, in the current implementation it would still be necessary to unmount/remount (or at least force a client reconnection) if this parameter is changed. The main question is whether there is any value for the MDS to "limit" the value that can be set by the client (which is not done for max_rpcs_in_flight or most other parameters) or if the client should be able set this larger than the default value the MDT returned (maybe some upper limit like 4x or 8x the MDT limit)? That would allow something like "lctl set_param -P ..max_mod_rpcs_in_time" to affect both the clients and servers. |
| Comment by Vitaliy Kuznetsov [ 11/Jan/23 ] |
|
adilger Ok, I'll start working on a solution to this ticket. |
| Comment by Gerrit Updater [ 24/Jan/23 ] |
|
"Vitaliy Kuznetsov <vkuznetsov@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49749 |
| Comment by Gerrit Updater [ 14/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49749/ |
| Comment by Alex Zhuravlev [ 14/Feb/23 ] |
|
just got this locally: == conf-sanity test 90b: check max_mod_rpcs_in_flight is enforced after update ========================================================== 08:08:05 (1676362085) start mds service on tmp.8Qi9eBagDy Loading modules from /mnt/build/lustre/tests/.. detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Starting mds1: -o localrecov lustre-mdt1/mdt1 /mnt/lustre-mds1 Started lustre-MDT0000 start mds service on tmp.8Qi9eBagDy Starting mds2: -o localrecov lustre-mdt2/mdt2 /mnt/lustre-mds2 Started lustre-MDT0001 tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F" tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F" start ost1 service on tmp.8Qi9eBagDy Starting ost1: -o localrecov lustre-ost1/ost1 /mnt/lustre-ost1 Started lustre-OST0000 tmp.8Qi9eBagDy: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F" mount lustre on /mnt/lustre..... Starting client: tmp.8Qi9eBagDy: -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre mdc.lustre-MDT0000-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=1 max_mod_rpcs_in_flight set to 1 creating 2 files ... fail_loc=0x159 launch 0 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity1/file-1 has perms 0600 OK fail_loc=0x159 launch 1 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity1/file-2 has perms 0644 OK mdc.lustre-MDT0001-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=5 max_mod_rpcs_in_flight set to 5 creating 6 files ... fail_loc=0x159 launch 4 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity2/file-5 has perms 0600 OK fail_loc=0x159 launch 5 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity2/file-6 has perms 0644 OK mdt_max_mod_rpcs_in_flight is 8 8 umount lustre on /mnt/lustre..... Stopping client tmp.8Qi9eBagDy /mnt/lustre (opts:) error: set_param: setting /sys/fs/lustre/mdt/lustre-MDT0000/max_mod_rpcs_in_flight=16: Numerical result out of range mount lustre on /mnt/lustre..... Starting client: tmp.8Qi9eBagDy: -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre mdc.lustre-MDT0000-mdc-ffff8a784a138000.max_rpcs_in_flight=17 error: set_param: setting /sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff8a784a138000/max_mod_rpcs_in_flight=16: Numerical result out of range conf-sanity test_90b: @@@@@@ FAIL: Unable to set max_mod_rpcs_in_flight to 16 Trace dump: = ./../tests/test-framework.sh:6549:error() = conf-sanity.sh:7139:check_max_mod_rpcs_in_flight() = conf-sanity.sh:7291:test_90b() = ./../tests/test-framework.sh:6887:run_one() = ./../tests/test-framework.sh:6937:run_one_logged() = ./../tests/test-framework.sh:6773:run_test() = conf-sanity.sh:7299:main() |
| Comment by Vitaliy Kuznetsov [ 15/Feb/23 ] |
|
Minor fix for limit in |