Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16872

sanity: test_27M Error: '(5) stripe count , should be 8 for append'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Frank Sehr <fsehr@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/46d1b101-8e4f-4415-bb49-ee39963275fe

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/95319 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/95319 - 4.18.0-425.10.1.el8_lustre.x86_64

      == sanity test 27M: test O_APPEND striping =============== 03:40:09 (1685936409)
      CMD: trevis-129vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
      striped dir -i3 -c2 -H crush2 /mnt/lustre/d27M.sanity
      CMD: trevis-129vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.append_pool
      CMD: trevis-129vm4 /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.append_stripe_count
      CMD: trevis-129vm4,trevis-129vm5 /usr/sbin/lctl set_param mdd.*.append_stripe_count=0
      mdd.lustre-MDT0000.append_stripe_count=0
      mdd.lustre-MDT0002.append_stripe_count=0
      mdd.lustre-MDT0001.append_stripe_count=0
      mdd.lustre-MDT0003.append_stripe_count=0
      CMD: trevis-129vm4,trevis-129vm5 /usr/sbin/lctl set_param mdd.*.append_stripe_count=2
      mdd.lustre-MDT0000.append_stripe_count=2
      mdd.lustre-MDT0002.append_stripe_count=2
      mdd.lustre-MDT0001.append_stripe_count=2
      mdd.lustre-MDT0003.append_stripe_count=2
      CMD: trevis-129vm4,trevis-129vm5 /usr/sbin/lctl set_param mdd.*.append_stripe_count=-1
      mdd.lustre-MDT0000.append_stripe_count=-1
      mdd.lustre-MDT0002.append_stripe_count=-1
      mdd.lustre-MDT0001.append_stripe_count=-1
      mdd.lustre-MDT0003.append_stripe_count=-1
      /usr/lib64/lustre/tests/sanity.sh: line 3101: /mnt/lustre/d27M.sanity/f27M.sanity.5: Invalid argument
      lfs: getstripe for '/mnt/lustre/d27M.sanity/f27M.sanity.5' failed: No such file or directory
      /usr/lib64/lustre/tests/sanity.sh: line 3103: [: -eq: unary operator expected
      sanity test_27M: @@@@@@ FAIL: (5) stripe count , should be 8 for append
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:6585:error()
      = /usr/lib64/lustre/tests/sanity.sh:3104:test_27M()
      = /usr/lib64/lustre/tests/test-framework.sh:6925:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:6974:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:6811:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:3181:main()
      Dumping lctl log to /autotest/autotest-2/2023-06-05/lustre-reviews_review-ldiskfs-dne_95319_27_26b27a74-2421-4453-9c33-cd237feca413//sanity.test_27M.*.1685936415.log
      CMD: trevis-129vm1.trevis.whamcloud.com,trevis-129vm2,trevis-129vm3,trevis-129vm4,trevis-129vm5 /usr/sbin/lctl dk > /autotest/autotest-2/2023-06-05/lustre-reviews_review-ldiskfs-dne_95319_27_26b27a74-2421-4453-9c33-cd237feca413//sanity.test_27M.debug_log.$(hostname -s).1685936415.log;
      dmesg > /autotest/autotest-2/2023-06-05/lustre-reviews_review-ldiskfs-dne_95319_27_26b27a74-2421-4453-9c33-cd237feca413//sanity.test_27M.dmesg.$(hostname -s).1685936415.log
      CMD: trevis-129vm4,trevis-129vm5 /usr/sbin/lctl set_param mdd.*.append_stripe_count=1
      mdd.lustre-MDT0000.append_stripe_count=1
      mdd.lustre-MDT0002.append_stripe_count=1
      mdd.lustre-MDT0001.append_stripe_count=1
      mdd.lustre-MDT0003.append_stripe_count=1
      CMD: trevis-129vm4,trevis-129vm5 /usr/sbin/lctl set_param mdd.*.append_pool=none
      mdd.lustre-MDT0000.append_pool=none
      mdd.lustre-MDT0002.append_pool=none
      mdd.lustre-MDT0001.append_pool=none
      mdd.lustre-MDT0003.append_pool=none

      Attachments

        Issue Links

          Activity

            [LU-16872] sanity: test_27M Error: '(5) stripe count , should be 8 for append'
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51602/
            Subject: LU-16872 tests: exercise sanity test_27M more fully
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7bb1685048bf999df03ceadab39faa09b8a5560d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51602/ Subject: LU-16872 tests: exercise sanity test_27M more fully Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7bb1685048bf999df03ceadab39faa09b8a5560d

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51559/
            Subject: LU-16872 lod: reset llc_ostlist when using O_APPEND stripes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 766b35a9700f36aa08b652fa9d18b890d34bf4a5

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51559/ Subject: LU-16872 lod: reset llc_ostlist when using O_APPEND stripes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 766b35a9700f36aa08b652fa9d18b890d34bf4a5

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51602
            Subject: LU-16872 tests: exercise sanity test_27M more fully
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 221a2d05d5d4ec2b39c88c6a5d84df2ba3f177dc

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51602 Subject: LU-16872 tests: exercise sanity test_27M more fully Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 221a2d05d5d4ec2b39c88c6a5d84df2ba3f177dc

            How to reproduce:

            # setup:
            $ lctl set_param mdd.*.append_stripe_count=-1
            $ lfs setstripe -o 1,3 /mnt/lustre
            
            # touch enough files with the default striping so that every mdt kernel thread probably has the defaults stored in its memory
            $ for i in {0..100}; do touch /mnt/lustre/x$i; done
            
            # now an append should return EINVAL as long as it gets handled by a kernel thread that previously did a create with default stripes
            $ echo 1 >> /mnt/lustre/f
            -bash: /mnt/lustre/g: Invalid argument

            A closely related problem occurs when an append_pool is set, but in this case, the create succeeds, but the append file is created with the default stripes rather than the pool.

            I haven't identified which patch caused (or uncovered) the issue yet – I didn't see anything obvious in the patches merged shortly before the first test failure occurred. So I'll attempt a git bisect to try to find what caused this and will update if I get that answer.

            bertschinger Thomas Bertschinger added a comment - How to reproduce: # setup: $ lctl set_param mdd.*.append_stripe_count=-1 $ lfs setstripe -o 1,3 /mnt/lustre # touch enough files with the default striping so that every mdt kernel thread probably has the defaults stored in its memory $ for i in {0..100}; do touch /mnt/lustre/x $i ; done # now an append should return EINVAL as long as it gets handled by a kernel thread that previously did a create with default stripes $ echo 1 >> /mnt/lustre/f -bash: /mnt/lustre/g: Invalid argument A closely related problem occurs when an append_pool is set, but in this case, the create succeeds, but the append file is created with the default stripes rather than the pool. I haven't identified which patch caused (or uncovered) the issue yet – I didn't see anything obvious in the patches merged shortly before the first test failure occurred. So I'll attempt a git bisect to try to find what caused this and will update if I get that answer.

            "Thomas Bertschinger <bertschinger@lanl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51559
            Subject: LU-16872 lod: do not stripe O_APPEND files on specific OSTs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 81e2439da705dd35dc2c8c687be21cf7dc952eba

            gerrit Gerrit Updater added a comment - "Thomas Bertschinger <bertschinger@lanl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51559 Subject: LU-16872 lod: do not stripe O_APPEND files on specific OSTs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 81e2439da705dd35dc2c8c687be21cf7dc952eba

            Oops, I wasn't thinking straight this morning but overflowing llc_pool wouldn't affect op_array and op_count since the buffer wouldn't be adjacent to these fields. So I'm still looking for what could cause op_array and op_count to have bad values. (I still think that's the most likely explanation for the issue.)

            bertschinger Thomas Bertschinger added a comment - Oops, I wasn't thinking straight this morning but overflowing llc_pool wouldn't affect op_array and op_count since the buffer wouldn't be adjacent to these fields. So I'm still looking for what could cause op_array and op_count to have bad values. (I still think that's the most likely explanation for the issue.)

            Here's a status update since the bug is still causing failures. I'm still looking at this but have not found the root cause yet, but I have a hypothesis.

            The origin of the EINVAL response to the open() call is lod_alloc_ost_list(). From the debug logs, it seems it must be one of these:

                            if (!test_bit(ost_idx, m->lod_ost_bitmap)) {
                                    rc = -EINVAL;
                                    break;
                            }
            ...
                            if (lod_qos_is_tgt_used(env, ost_idx, stripe_count) &&
                                !(lod_comp->llc_pattern & LOV_PATTERN_OVERSTRIPING)) {
                                    rc = -EINVAL;
                                    break;
                            } 

            However, I am fairly confident that lod_alloc_ost_list() should NOT be getting called in the append case at all, because this function appears to be for files with specifically set stripes, e.g., it gets called if a file inherits custom striping from its parent directory.

            Therefore, I believe the actual problem is that lod_comp->llc_ostlist.op_array and lod_comp->llc_ostlist.op_count are incorrectly non-zero when the failure occurs. My current hypothesis for the cause is this: the struct lu_tgt_pool inside struct lod_layout_component (where the op_array and op_count fields are) is preceded by char *llc_pool so it's possible that char * is being overflowed and putting garbage values into the array and count fields. So currently I'm looking at where llc_pool is set and if any of these spots could plausibly have an overflow.

            bertschinger Thomas Bertschinger added a comment - Here's a status update since the bug is still causing failures. I'm still looking at this but have not found the root cause yet, but I have a hypothesis. The origin of the EINVAL response to the open() call is lod_alloc_ost_list() . From the debug logs, it seems it must be one of these:               if (!test_bit(ost_idx, m->lod_ost_bitmap)) {                         rc = -EINVAL;                         break ;                 } ...                 if (lod_qos_is_tgt_used(env, ost_idx, stripe_count) &&                     !(lod_comp->llc_pattern & LOV_PATTERN_OVERSTRIPING)) {                         rc = -EINVAL;                         break ;                 } However, I am fairly confident that lod_alloc_ost_list() should NOT be getting called in the append case at all, because this function appears to be for files with specifically set stripes, e.g., it gets called if a file inherits custom striping from its parent directory. Therefore, I believe the actual problem is that lod_comp->llc_ostlist.op_array and lod_comp->llc_ostlist.op_count are incorrectly non-zero when the failure occurs. My current hypothesis for the cause is this: the struct lu_tgt_pool inside struct lod_layout_component (where the op_array and op_count fields are) is preceded by char *llc_pool so it's possible that char * is being overflowed and putting garbage values into the array and count fields. So currently I'm looking at where llc_pool is set and if any of these spots could plausibly have an overflow.
            arshad512 Arshad Hussain added a comment - +1 on master ( https://testing.whamcloud.com/sub_tests/399cef35-a321-43da-944a-84f8ce67c9f1 )

            Andreas - I'm away on vacation through June 25 so if this bug needs a quick resolution, you may want to have someone else look into it. Otherwise I'll continue to work on this when I get back next week.

            bertschinger Thomas Bertschinger added a comment - Andreas - I'm away on vacation through June 25 so if this bug needs a quick resolution, you may want to have someone else look into it. Otherwise I'll continue to work on this when I get back next week.

            People

              bertschinger Thomas Bertschinger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: