Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16454

can't set max_mod_rpcs_in_flight > 8

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.15.1
    • None
    • 3
    • 9223372036854775807

    Description

      I am trying to increase mdc.*.max_mod_rpcs_in_flight to grater than 8 but I get an error.

       

      # lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_rpcs_in_flight=128
      mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_rpcs_in_flight=128

      #lctl set_param mdc.fs1-MDT0000-mdc-ffff902107a5e000.max_mod_rpcs_in_flight=127
      error: set_param: setting /sys/fs/lustre/mdc/fs1-MDT0000-mdc-ffff902107a5e000/max_mod_rpcs_in_flight=127: Numerical result out of range

      # lctl get_param version
      version=2.15.1

      Attachments

        Issue Links

          Activity

            [LU-16454] can't set max_mod_rpcs_in_flight > 8
            vkuznetsov Vitaliy Kuznetsov added a comment - - edited

            Minor fix for limit in LU-16558

            vkuznetsov Vitaliy Kuznetsov added a comment - - edited Minor fix for limit in  LU-16558

            just got this locally:

            == conf-sanity test 90b: check max_mod_rpcs_in_flight is enforced after update ========================================================== 08:08:05 (1676362085)
            start mds service on tmp.8Qi9eBagDy
            Loading modules from /mnt/build/lustre/tests/..
            detected 2 online CPUs by sysfs
            Force libcfs to create 2 CPU partitions
            ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
            gss/krb5 is not supported
            quota/lquota options: 'hash_lqs_cur_bits=3'
            Starting mds1: -o localrecov  lustre-mdt1/mdt1 /mnt/lustre-mds1
            Started lustre-MDT0000
            start mds service on tmp.8Qi9eBagDy
            Starting mds2: -o localrecov  lustre-mdt2/mdt2 /mnt/lustre-mds2
            Started lustre-MDT0001
            tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
            tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
            tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
            tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
            tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
            tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
            start ost1 service on tmp.8Qi9eBagDy
            Starting ost1: -o localrecov  lustre-ost1/ost1 /mnt/lustre-ost1
            Started lustre-OST0000
            tmp.8Qi9eBagDy: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
            tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config
            tmp.8Qi9eBagDy: EXCEPT="$EXCEPT 32 53 63 102 115 119 123F"
            mount lustre  on /mnt/lustre.....
            Starting client: tmp.8Qi9eBagDy:  -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre
            mdc.lustre-MDT0000-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=1
            max_mod_rpcs_in_flight set to 1
            creating 2 files ...
            fail_loc=0x159
            launch 0 chmod in parallel ...
            fail_loc=0
            launch 1 additional chmod in parallel ...
            /mnt/lustre/d90b.conf-sanity1/file-1 has perms 0600 OK
            fail_loc=0x159
            launch 1 chmod in parallel ...
            fail_loc=0
            launch 1 additional chmod in parallel ...
            /mnt/lustre/d90b.conf-sanity1/file-2 has perms 0644 OK
            mdc.lustre-MDT0001-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=5
            max_mod_rpcs_in_flight set to 5
            creating 6 files ...
            fail_loc=0x159
            launch 4 chmod in parallel ...
            fail_loc=0
            launch 1 additional chmod in parallel ...
            /mnt/lustre/d90b.conf-sanity2/file-5 has perms 0600 OK
            fail_loc=0x159
            launch 5 chmod in parallel ...
            fail_loc=0
            launch 1 additional chmod in parallel ...
            /mnt/lustre/d90b.conf-sanity2/file-6 has perms 0644 OK
            mdt_max_mod_rpcs_in_flight is 8 8
            umount lustre on /mnt/lustre.....
            Stopping client tmp.8Qi9eBagDy /mnt/lustre (opts:)
            error: set_param: setting /sys/fs/lustre/mdt/lustre-MDT0000/max_mod_rpcs_in_flight=16: Numerical result out of range
            mount lustre  on /mnt/lustre.....
            Starting client: tmp.8Qi9eBagDy:  -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre
            mdc.lustre-MDT0000-mdc-ffff8a784a138000.max_rpcs_in_flight=17
            error: set_param: setting /sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff8a784a138000/max_mod_rpcs_in_flight=16: Numerical result out of range
             conf-sanity test_90b: @@@@@@ FAIL: Unable to set max_mod_rpcs_in_flight to 16 
              Trace dump:
              = ./../tests/test-framework.sh:6549:error()
              = conf-sanity.sh:7139:check_max_mod_rpcs_in_flight()
              = conf-sanity.sh:7291:test_90b()
              = ./../tests/test-framework.sh:6887:run_one()
              = ./../tests/test-framework.sh:6937:run_one_logged()
              = ./../tests/test-framework.sh:6773:run_test()
              = conf-sanity.sh:7299:main()
            
            bzzz Alex Zhuravlev added a comment - just got this locally: == conf-sanity test 90b: check max_mod_rpcs_in_flight is enforced after update ========================================================== 08:08:05 (1676362085) start mds service on tmp.8Qi9eBagDy Loading modules from /mnt/build/lustre/tests/.. detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Starting mds1: -o localrecov lustre-mdt1/mdt1 /mnt/lustre-mds1 Started lustre-MDT0000 start mds service on tmp.8Qi9eBagDy Starting mds2: -o localrecov lustre-mdt2/mdt2 /mnt/lustre-mds2 Started lustre-MDT0001 tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT= "$EXCEPT 32 53 63 102 115 119 123F" tmp.8Qi9eBagDy: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT= "$EXCEPT 32 53 63 102 115 119 123F" start ost1 service on tmp.8Qi9eBagDy Starting ost1: -o localrecov lustre-ost1/ost1 /mnt/lustre-ost1 Started lustre-OST0000 tmp.8Qi9eBagDy: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid tmp.8Qi9eBagDy: Reading test skip list from /tmp/ltest.config tmp.8Qi9eBagDy: EXCEPT= "$EXCEPT 32 53 63 102 115 119 123F" mount lustre on /mnt/lustre..... Starting client: tmp.8Qi9eBagDy: -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre mdc.lustre-MDT0000-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=1 max_mod_rpcs_in_flight set to 1 creating 2 files ... fail_loc=0x159 launch 0 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity1/file-1 has perms 0600 OK fail_loc=0x159 launch 1 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity1/file-2 has perms 0644 OK mdc.lustre-MDT0001-mdc-ffff8a7866aad000.max_mod_rpcs_in_flight=5 max_mod_rpcs_in_flight set to 5 creating 6 files ... fail_loc=0x159 launch 4 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity2/file-5 has perms 0600 OK fail_loc=0x159 launch 5 chmod in parallel ... fail_loc=0 launch 1 additional chmod in parallel ... /mnt/lustre/d90b.conf-sanity2/file-6 has perms 0644 OK mdt_max_mod_rpcs_in_flight is 8 8 umount lustre on /mnt/lustre..... Stopping client tmp.8Qi9eBagDy /mnt/lustre (opts:) error: set_param: setting /sys/fs/lustre/mdt/lustre-MDT0000/max_mod_rpcs_in_flight=16: Numerical result out of range mount lustre on /mnt/lustre..... Starting client: tmp.8Qi9eBagDy: -o user_xattr,flock tmp.8Qi9eBagDy@tcp:/lustre /mnt/lustre mdc.lustre-MDT0000-mdc-ffff8a784a138000.max_rpcs_in_flight=17 error: set_param: setting /sys/fs/lustre/mdc/lustre-MDT0000-mdc-ffff8a784a138000/max_mod_rpcs_in_flight=16: Numerical result out of range conf-sanity test_90b: @@@@@@ FAIL: Unable to set max_mod_rpcs_in_flight to 16 Trace dump: = ./../tests/test-framework.sh:6549:error() = conf-sanity.sh:7139:check_max_mod_rpcs_in_flight() = conf-sanity.sh:7291:test_90b() = ./../tests/test-framework.sh:6887:run_one() = ./../tests/test-framework.sh:6937:run_one_logged() = ./../tests/test-framework.sh:6773:run_test() = conf-sanity.sh:7299:main()

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49749/
            Subject: LU-16454 mdt: Add a per-MDT "max_mod_rpcs_in_flight"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f16c31ccd91d66caba69d3ceea6a61c1682df59e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49749/ Subject: LU-16454 mdt: Add a per-MDT "max_mod_rpcs_in_flight" Project: fs/lustre-release Branch: master Current Patch Set: Commit: f16c31ccd91d66caba69d3ceea6a61c1682df59e

            "Vitaliy Kuznetsov <vkuznetsov@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49749
            Subject: LU-16454 component: Add a per-MDT "max_mod_rpcs_in_flight"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 23463fee16abd0821b95129b333c31f354cf8a94

            gerrit Gerrit Updater added a comment - "Vitaliy Kuznetsov <vkuznetsov@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49749 Subject: LU-16454 component: Add a per-MDT "max_mod_rpcs_in_flight" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 23463fee16abd0821b95129b333c31f354cf8a94

            adilger Ok, I'll start working on a solution to this ticket.
            Thanks

            vkuznetsov Vitaliy Kuznetsov added a comment - adilger Ok, I'll start working on a solution to this ticket. Thanks

            Vitaliy, in my previous investigation of a similar issue in LU-14144 I couldn't find any good reason in the code or commit history why max_mod_rpcs_per_client was specifically a module parameter on the server and not a regular sysfs parameter. There doesn't appear to be any runtime dependency on this value (i.e. it doesn't define a static number of slots for the per-client replies or anything), and the only thing it is used for is to pass the limit to the client. For the same reason, there also doesn't appear to be a particularly hard limitation why the client cannot change and exceed the server-provided parameter, except to avoid overloading the server with too many RPCs at once, but that may also be true of the current limit with a larger number of clients, no different than "max_rpcs_in_flight".

            It seems reasonable to add a per-MDT "max_mod_rpcs_in_flight" tunable parameter to lustre/mdt/mdt_lproc.c so that it can be set with "lctl set_param" at runtime, for example like async_commit_count. The global max_mod_rpcs_per_client parameter should be used as the initial value, and add "(deprecated)" to the module description in mdt_handler.c.

            Mahmoud, the console error message printed when the client limit is reached is "myth-MDT0000-mdc-ffff979380fc1800: can't set max_mod_rpcs_in_flight=32 higher than ocd_maxmodrpcs=8 returned by the server at connection" but I agree this isn't totally clear. Instead of reporting "ocd_maxmodrpcs" (which is an internal field name) it should report the new "mdt.myth-MDT0000.max_mod_rpcs_in_flight" parameter, which would steer the admin to the right location to change this value. However, in the current implementation it would still be necessary to unmount/remount (or at least force a client reconnection) if this parameter is changed.

            The main question is whether there is any value for the MDS to "limit" the value that can be set by the client (which is not done for max_rpcs_in_flight or most other parameters) or if the client should be able set this larger than the default value the MDT returned (maybe some upper limit like 4x or 8x the MDT limit)? That would allow something like "lctl set_param -P ..max_mod_rpcs_in_time" to affect both the clients and servers.

            adilger Andreas Dilger added a comment - Vitaliy, in my previous investigation of a similar issue in LU-14144 I couldn't find any good reason in the code or commit history why max_mod_rpcs_per_client was specifically a module parameter on the server and not a regular sysfs parameter. There doesn't appear to be any runtime dependency on this value (i.e. it doesn't define a static number of slots for the per-client replies or anything), and the only thing it is used for is to pass the limit to the client. For the same reason, there also doesn't appear to be a particularly hard limitation why the client cannot change and exceed the server-provided parameter, except to avoid overloading the server with too many RPCs at once, but that may also be true of the current limit with a larger number of clients, no different than " max_rpcs_in_flight ". It seems reasonable to add a per-MDT " max_mod_rpcs_in_flight " tunable parameter to lustre/mdt/mdt_lproc.c so that it can be set with " lctl set_param " at runtime, for example like async_commit_count . The global max_mod_rpcs_per_client parameter should be used as the initial value, and add " (deprecated) " to the module description in mdt_handler.c . Mahmoud, the console error message printed when the client limit is reached is " myth-MDT0000-mdc-ffff979380fc1800: can't set max_mod_rpcs_in_flight=32 higher than ocd_maxmodrpcs=8 returned by the server at connection " but I agree this isn't totally clear. Instead of reporting " ocd_maxmodrpcs " (which is an internal field name) it should report the new " mdt.myth-MDT0000.max_mod_rpcs_in_flight " parameter, which would steer the admin to the right location to change this value. However, in the current implementation it would still be necessary to unmount/remount (or at least force a client reconnection) if this parameter is changed. The main question is whether there is any value for the MDS to "limit" the value that can be set by the client (which is not done for max_rpcs_in_flight or most other parameters) or if the client should be able set this larger than the default value the MDT returned (maybe some upper limit like 4x or 8x the MDT limit)? That would allow something like " lctl set_param -P . .max_mod_rpcs_in_time " to affect both the clients and servers.
            pjones Peter Jones added a comment -

            Vitaliy

            We discussed this during the triage call today. Andreas has some suggestions of how to address this issue that he will share and then could you please follow up and implement?

            Thanks

            Peter

            pjones Peter Jones added a comment - Vitaliy We discussed this during the triage call today. Andreas has some suggestions of how to address this issue that he will share and then could you please follow up and implement? Thanks Peter

            I figured out the issue. It was module setting on the server. The documentation should be update to state that the server side module param should be updated first.

             

            mhanafi Mahmoud Hanafi added a comment - I figured out the issue. It was module setting on the server. The documentation should be update to state that the server side module param should be updated first.  

            Here is the error in debug logs

             

            00000020:00020000:0.0F:1673137999.151510:0:2742:0:(genops.c:2175:obd_set_max_mod_rpcs_in_flight()) fs1-MDT0000-mdc-ffff902107a5e000: can't set max_mod_rpcs_in_flight=9 higher than ocd_maxmodrpcs=8 returned by the server at connection

            mhanafi Mahmoud Hanafi added a comment - Here is the error in debug logs   00000020:00020000:0.0F:1673137999.151510:0:2742:0:(genops.c:2175:obd_set_max_mod_rpcs_in_flight()) fs1-MDT0000-mdc-ffff902107a5e000: can't set max_mod_rpcs_in_flight=9 higher than ocd_maxmodrpcs=8 returned by the server at connection

            People

              vkuznetsov Vitaliy Kuznetsov
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: