Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.9.0
    • DNE2 system with up to 16 MDS servers. Uses up to 400 client nodes spread across 20 physical nodes. All the results are based on mdtest 1.9.3 runs.
    • 9223372036854775807

    Description

      I did a detail study of the client scaling behavior for 10k and 100k files per directory using 1,2,4, and 8 MDS servers each having one MDT. I also attempted to collect dat for 16 MDS servers but the results were so bad I didn't both to finish collecting them since it would take several months to finish the 16 node case.

      Attachments

        1. 1-mds-dne2-100k-dir-2.8.png
          1-mds-dne2-100k-dir-2.8.png
          12 kB
        2. 1-mds-dne2-100k-files-2.8.png
          1-mds-dne2-100k-files-2.8.png
          11 kB
        3. 1-mds-dne2-10k-dir-2.8.png
          1-mds-dne2-10k-dir-2.8.png
          13 kB
        4. 1-mds-dne2-10k-files-2.8.png
          1-mds-dne2-10k-files-2.8.png
          12 kB
        5. 2-mds-dne2-100k-dir-2.8.png
          2-mds-dne2-100k-dir-2.8.png
          8 kB
        6. 2-mds-dne2-100k-files-2.8.png
          2-mds-dne2-100k-files-2.8.png
          11 kB
        7. 2-mds-dne2-10k-dir-2.8.png
          2-mds-dne2-10k-dir-2.8.png
          10 kB
        8. 2-mds-dne2-10k-files-2.8.png
          2-mds-dne2-10k-files-2.8.png
          11 kB
        9. 4-mds-dne2-100k-dir-2.8.png
          4-mds-dne2-100k-dir-2.8.png
          9 kB
        10. 4-mds-dne2-100k-files-2.8.png
          4-mds-dne2-100k-files-2.8.png
          13 kB
        11. 4-mds-dne2-10k-dir-2.8.png
          4-mds-dne2-10k-dir-2.8.png
          11 kB
        12. 4-mds-dne2-10k-files-2.8.png
          4-mds-dne2-10k-files-2.8.png
          14 kB
        13. 8-mds-dne2-100k-dir-2.8.png
          8-mds-dne2-100k-dir-2.8.png
          14 kB
        14. 8-mds-dne2-100k-files-2.8.png
          8-mds-dne2-100k-files-2.8.png
          15 kB
        15. 8-mds-dne2-10k-dir-2.8.png
          8-mds-dne2-10k-dir-2.8.png
          13 kB
        16. 8-mds-dne2-10k-files-2.8.png
          8-mds-dne2-10k-files-2.8.png
          14 kB
        17. mdtest-scale.pbs
          0.9 kB

        Issue Links

          Activity

            [LU-7293] DNE2 perfomance analysis
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            closing ancient ticket

            pjones Peter Jones added a comment - closing ancient ticket
            adilger Andreas Dilger made changes -
            Link New: This issue is related to AEON-37 [ AEON-37 ]
            yujian Jian Yu added a comment -

            Hi James,

            Thank you very much for the report!

            yujian Jian Yu added a comment - Hi James, Thank you very much for the report!

            Here is the final report of our results for our DNE2 performance analysis

            http://info.ornl.gov/sites/publications/Files/Pub59510.pdf

            Enjoy the read. Perhaps we can link it to the wiki. If people want it linked to the wiki we can do that
            and then close this ticket. If not we can keep this ticket open for a few more weeks so people can
            have a chance to read this.

            simmonsja James A Simmons added a comment - Here is the final report of our results for our DNE2 performance analysis http://info.ornl.gov/sites/publications/Files/Pub59510.pdf Enjoy the read. Perhaps we can link it to the wiki. If people want it linked to the wiki we can do that and then close this ticket. If not we can keep this ticket open for a few more weeks so people can have a chance to read this.
            adilger Andreas Dilger made changes -
            Labels Original: dne2 New: dne2 dne3
            pjones Peter Jones made changes -
            End date New: 21/Dec/15
            Start date New: 13/Oct/15

            Oh this doesn't look right.

            ldlm.namespaces.sultan-MDT0000-mdc-ffff8803f3d12c00.lru_size=29
            ldlm.namespaces.sultan-MDT0001-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0002-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0003-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0004-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0005-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0006-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0007-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0008-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT0009-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000a-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000b-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000c-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000d-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000e-mdc-ffff8803f3d12c00.lru_size=0
            ldlm.namespaces.sultan-MDT000f-mdc-ffff8803f3d12c00.lru_size=0

            simmonsja James A Simmons added a comment - Oh this doesn't look right. ldlm.namespaces.sultan-MDT0000-mdc-ffff8803f3d12c00.lru_size=29 ldlm.namespaces.sultan-MDT0001-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0002-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0003-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0004-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0005-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0006-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0007-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0008-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT0009-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000a-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000b-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000c-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000d-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000e-mdc-ffff8803f3d12c00.lru_size=0 ldlm.namespaces.sultan-MDT000f-mdc-ffff8803f3d12c00.lru_size=0

            Hmm, according to the slab information in Nov 15th, it seems "lustre_inode_cache" is much more than "inode_cache", so it means client has more ll_inode_info than inode, hmm, maybe ll_inode_info is leaked somewhere. Do you still keep that client? Could you please get lru_size for me?

            lctl get_param ldlm.*.*MDT*.lru_size
            
            di.wang Di Wang (Inactive) added a comment - Hmm, according to the slab information in Nov 15th, it seems "lustre_inode_cache" is much more than "inode_cache", so it means client has more ll_inode_info than inode, hmm, maybe ll_inode_info is leaked somewhere. Do you still keep that client? Could you please get lru_size for me? lctl get_param ldlm.*.*MDT*.lru_size

            I did both file only operations and with directory operations. I did tracked down the issue of directory operations as well. What is happening in that case is that the lustre inode cache is consuming all the memory on the client thus causing various timeout and client evictions and reconnects. This only happens for when many directory operations are performed. When only doing file operations the memory pressure issues go away. My latest testings as all been without the --index.

            simmonsja James A Simmons added a comment - I did both file only operations and with directory operations. I did tracked down the issue of directory operations as well. What is happening in that case is that the lustre inode cache is consuming all the memory on the client thus causing various timeout and client evictions and reconnects. This only happens for when many directory operations are performed. When only doing file operations the memory pressure issues go away. My latest testings as all been without the --index.

            People

              di.wang Di Wang (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: