Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16126

Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I have been trying to tune the knob of the max_rpcs_in_progress parameter on the MDTs in my system. Environment:
      64 MDTs, one per MDS
      64 OSTs, one per OSS
      512 clients
      All using 2.15.51

      On the MDTs, the osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:

      V-1: Rank   0 Line  2565    Operation               Duration              Rate
      V-1: Rank   0 Line  2566    ---------               --------              ----
      V-1: Rank   0 Line  1957 main: * iteration 1 *
      V-2: Rank   0 Line  1966 main (for j loop): making o.testdir, '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0'
      V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0...
      V-2: Rank   0 Line  1889 Making directory '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/'
      V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1...
      V-1: Rank   0 Line  2033 V-1: main:   Tree creation     :          2.250 sec,          0.444 ops/sec
      delaying 30 seconds . . .
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328
      srun: error: mo0627: tasks 7328-7343: Exited with exit code 255
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333
      

       
      The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.

      It seems like I may have narrowed this down to commit 23028efcae LU-6864 osp: manage number of modify RPCs in flight. If I revert this commit, I cannot reproduce the error.

      Attachments

        Issue Links

          Activity

            [LU-16126] Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients

            I took another look at this and in my testing I see:

            /usr/src/lustre-2.15.52/lustre/tests# ../utils/lctl get_param ..max_rpcs_*
            mdc.lustre-MDT0000-mdc-ffff997ccd58f000.max_rpcs_in_flight=8
            osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8
            osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096
            osc.lustre-OST0000-osc-ffff997ccd58f000.max_rpcs_in_flight=8
            osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8
            osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096
            osc.lustre-OST0001-osc-ffff997ccd58f000.max_rpcs_in_flight=8
            osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8
            osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096
            osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8
            osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096

            Nathan what version of Lustre you are running that you see a setting of zero?

             

            simmonsja James A Simmons added a comment - I took another look at this and in my testing I see: /usr/src/lustre-2.15.52/lustre/tests# ../utils/lctl get_param . .max_rpcs_* mdc.lustre-MDT0000-mdc-ffff997ccd58f000.max_rpcs_in_flight=8 osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8 osc.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096 osc.lustre-OST0000-osc-ffff997ccd58f000.max_rpcs_in_flight=8 osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8 osc.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096 osc.lustre-OST0001-osc-ffff997ccd58f000.max_rpcs_in_flight=8 osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_flight=8 osp.lustre-OST0000-osc-MDT0000.max_rpcs_in_progress=4096 osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_flight=8 osp.lustre-OST0001-osc-MDT0000.max_rpcs_in_progress=4096 Nathan what version of Lustre you are running that you see a setting of zero?  

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49182
            Subject: LU-16126 ldlm: set default rpcs_in_flight to one
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ee2925431a7c2fb0559fd02a4e10a9b4a85f57aa

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49182 Subject: LU-16126 ldlm: set default rpcs_in_flight to one Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ee2925431a7c2fb0559fd02a4e10a9b4a85f57aa

            Yeah the initial value is wrong.

            simmonsja James A Simmons added a comment - Yeah the initial value is wrong.

            So maybe the issue here is that the max_rpcs_in_progress value shouldn't show "0" initially?

            adilger Andreas Dilger added a comment - So maybe the issue here is that the max_rpcs_in_progress value shouldn't show "0" initially?

            That is the same behavior for mdc.*.max_rpcs_in_progress. Its due to the test:

              if (max > OBD_MAX_RIF_MAX || max < 1)
                           return -ERANGE;

            in obd_set_max_rpcs_in_flight(). Setting it back to 1 is the default setting. Zero would mean don't send anything which we don't want.

            simmonsja James A Simmons added a comment - That is the same behavior for mdc.*.max_rpcs_in_progress. Its due to the test:   if (max > OBD_MAX_RIF_MAX || max < 1)                return -ERANGE; in obd_set_max_rpcs_in_flight(). Setting it back to 1 is the default setting. Zero would mean don't send anything which we don't want.

            Nathan, I did notice that but forgot about that quirk until now. Thank you for refreshing my memory.

            koutoupis Petros Koutoupis added a comment - Nathan, I did notice that but forgot about that quirk until now. Thank you for refreshing my memory.

            Another problem: you can't get it back to 0:

            [root@cslmo4902 ~]# lctl get_param *.*.max_rpcs_in_progress
            osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
            osp.testfs-MDT0001-osp-MDT0000.max_rpcs_in_progress=0
            [root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
            osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100
            [root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0
            error: set_param: setting /sys/fs/lustre/osp/testfs-MDT0000-osp-MDT0001/max_rpcs_in_progress=0: Numerical result out of range
            
            nrutman Nathan Rutman added a comment - Another problem: you can't get it back to 0: [root@cslmo4902 ~]# lctl get_param *.*.max_rpcs_in_progress osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0 osp.testfs-MDT0001-osp-MDT0000.max_rpcs_in_progress=0 [root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100 osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=100 [root@cslmo4902 ~]# lctl set_param osp.testfs-MDT0000-osp-MDT0001.max_rpcs_in_progress=0 error: set_param: setting /sys/fs/lustre/osp/testfs-MDT0000-osp-MDT0001/max_rpcs_in_progress=0: Numerical result out of range

            I went back to collect more data and unfortunately, I am unable to reproduce the original issue. Closing this ticket unless I can observe the problem again.

            koutoupis Petros Koutoupis added a comment - I went back to collect more data and unfortunately, I am unable to reproduce the original issue. Closing this ticket unless I can observe the problem again.

            Note that we don't regularly test with 64 MDTs, so there may be some kind of thread starvation happening of all of the MDTs are sending RpcS to each other and the service thread count is not high enough.

            Even before debug logs, getting the console logs from the client and MDS nodes would be useful (eg. if clients are being evicted, stacks are dumped, etc.)

            adilger Andreas Dilger added a comment - Note that we don't regularly test with 64 MDTs, so there may be some kind of thread starvation happening of all of the MDTs are sending RpcS to each other and the service thread count is not high enough. Even before debug logs, getting the console logs from the client and MDS nodes would be useful (eg. if clients are being evicted, stacks are dumped, etc.)

            Do you have kernel logs? I do wonder if this is related to LU-16065

            simmonsja James A Simmons added a comment - Do you have kernel logs? I do wonder if this is related to LU-16065

            People

              adilger Andreas Dilger
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: