Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13187

sanity test_129: current dir size 4096, previous limit 20480

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.5
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/2c10b9da-44b8-11ea-bffa-52540065bddc

      test_129 failed with the following error:

      current dir size 4096,  previous limit 20480
      

      It looks like this started on 2020-01-28 when a number of patches landed.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_129 - current dir size 4096, previous limit 20480
      sanity test_129 - dirsize 4096 < 32768 after 93 files

      Attachments

        Issue Links

          Activity

            [LU-13187] sanity test_129: current dir size 4096, previous limit 20480
            dongyang Dongyang Li added a comment -

            Great, I just could not reproduce the problem.

            if it's failing with ENOSPC from osd_ldiskfs_append(), I think we can add a new param to ldiskfs/ext4_append() to bypass the limit check for the oi related code path, like iam_new_node(), iam_lfix_create() and iam_lvar_create().

            the normal dir is using a different code path, osd_ldiskfs_add_entry()->__ldiskfs/ext4_add_entry()

             

            dongyang Dongyang Li added a comment - Great, I just could not reproduce the problem. if it's failing with ENOSPC from osd_ldiskfs_append(), I think we can add a new param to ldiskfs/ext4_append() to bypass the limit check for the oi related code path, like iam_new_node(), iam_lfix_create() and iam_lvar_create(). the normal dir is using a different code path, osd_ldiskfs_add_entry()->__ldiskfs/ext4_add_entry()  
            neilb Neil Brown added a comment -

            I think this problem is caused by some metadata directory trying to grow.

            I added some tracing and found that the call to osd_ldiskfs_append() in iam_new_node() was failing with ENOSPC.

            Maybe the best fix would be to add a test to ldiskfs_append() to check if it is a special lustre metadata directory, and if so to bypass the dir limit.

            Is there an easy way to detect lustre metadata directories?

            neilb Neil Brown added a comment - I think this problem is caused by some metadata directory trying to grow. I added some tracing and found that the call to osd_ldiskfs_append() in iam_new_node() was failing with ENOSPC. Maybe the best fix would be to add a test to ldiskfs_append() to check if it is a special lustre metadata directory, and if so to bypass the dir limit. Is there an easy way to detect lustre metadata directories?

            RHE8 and Ubuntu overlap for the ldiskfs patches so they will need to be updated at the same time. I can update Ubuntu but I don't have a RHEL8 system to fix it up on.

            simmonsja James A Simmons added a comment - RHE8 and Ubuntu overlap for the ldiskfs patches so they will need to be updated at the same time. I can update Ubuntu but I don't have a RHEL8 system to fix it up on.

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/39773
            Subject: LU-13187 ldiskfs: Fix max_dir_size_kb for RHEL7
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 43fe1051ee6cab1c9f8b85863ec91aec2c06b251

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/39773 Subject: LU-13187 ldiskfs: Fix max_dir_size_kb for RHEL7 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 43fe1051ee6cab1c9f8b85863ec91aec2c06b251
            adilger Andreas Dilger added a comment - - edited

            There were 55 failures of this subtest in the last week, which is about a 10% failure rate, but since sanity is running multiple times per patch, it is affecting landing more than this.

            adilger Andreas Dilger added a comment - - edited There were 55 failures of this subtest in the last week, which is about a 10% failure rate, but since sanity is running multiple times per patch, it is affecting landing more than this.
            adilger Andreas Dilger added a comment - +3 on master: https://testing.whamcloud.com/test_sessions/7b47cafb-4b4e-4cc3-ae57-971c31e4ce84 https://testing.whamcloud.com/test_sessions/70e01f6b-f61c-4d82-a3c6-fa141eb170fe https://testing.whamcloud.com/test_sessions/27615c0d-2da3-42c0-8bb9-230da1f3acb2  
            hornc Chris Horn added a comment - +1 on master: https://testing.whamcloud.com/test_sets/18024581-0159-4d24-84ee-9ae6554ced77

            Note Neil also ran into this problem on SUSE15 and pushed a fix here:

            https://review.whamcloud.com/#/c/39571/

            The same problem could be for RHEL platforms.

            simmonsja James A Simmons added a comment - Note Neil also ran into this problem on SUSE15 and pushed a fix here: https://review.whamcloud.com/#/c/39571/ The same problem could be for RHEL platforms.
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sets/62ce778c-aac4-4504-a1dc-ecd559e78533
            hornc Chris Horn added a comment - +1 on master: https://testing.whamcloud.com/test_sets/1e95c770-b87b-4de6-9a58-08d40241c712
            emoly.liu Emoly Liu added a comment - - edited more on master: https://testing.whamcloud.com/test_sets/353838f4-221f-4336-accc-ccaea50e17e3 https://testing.whamcloud.com/test_sets/629cec52-dd19-40c0-b0f2-0c22435f81df

            People

              dongyang Dongyang Li
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: