Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16126

Lustre 2.15.51 mdtest fails with MPI_Abort errors while adjusting max_rpcs_in_progress and using large number of clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I have been trying to tune the knob of the max_rpcs_in_progress parameter on the MDTs in my system. Environment:
      64 MDTs, one per MDS
      64 OSTs, one per OSS
      512 clients
      All using 2.15.51

      On the MDTs, the osp.lustre-MDT0001-osp-MDT0000.max_rpcs_in_progress parameter is defaulted to 0. If I change this value to anything above (including 1) and run mdtest with 16 PPN per client using all 512 clients, I start to see MPI errors:

      V-1: Rank   0 Line  2565    Operation               Duration              Rate
      V-1: Rank   0 Line  2566    ---------               --------              ----
      V-1: Rank   0 Line  1957 main: * iteration 1 *
      V-2: Rank   0 Line  1966 main (for j loop): making o.testdir, '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0'
      V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0, currDepth = 0...
      V-2: Rank   0 Line  1889 Making directory '/lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/'
      V-1: Rank   0 Line  1883 Entering create_remove_directory_tree on /lustre/pkoutoupis/testdir..1mdt.0/test-dir.0-0/mdtest_tree.0.0/, currDepth = 1...
      V-1: Rank   0 Line  2033 V-1: main:   Tree creation     :          2.250 sec,          0.444 ops/sec
      delaying 30 seconds . . .
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7328
      srun: error: mo0627: tasks 7328-7343: Exited with exit code 255
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7329
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7330
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7331
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7332
      application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7333
      

       
      The mdtest log does not showcase a real root cause. And the client/MDS server logs also dont provide any clues.The interesting part is, if I use less clients, for example, 16, I can tune that param as high as 4096 (and higher) and I have experienced no issues.

      It seems like I may have narrowed this down to commit 23028efcae LU-6864 osp: manage number of modify RPCs in flight. If I revert this commit, I cannot reproduce the error.

      Attachments

        Issue Links

          Activity

            People

              adilger Andreas Dilger
              koutoupis Petros Koutoupis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: