Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14377

parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.1, Lustre 2.15.3
    • None
    • ZFS
    • 3
    • 9223372036854775807

    Description

      parallel-scale test_rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''. It looks like this test has been failing since at least 22 NOV 2019 with
      Lustre 2.12.3.31 - https://testing.whamcloud.com/test_sets/b8fa5b22-0d76-11ea-98f1-52540065bddc
      Lustre 2.13.51.72 - https://testing.whamcloud.com/test_sets/32e5a306-44f4-11ea-8072-52540065bddc

      Since 2020-06-11, this test is occasionally failing.

      Looking at the test_suite log for the failure at https://testing.whamcloud.com/test_sets/875e0375-cc23-4f0f-8291-f4f9034e340c, we see

      + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -np 22 /usr/lib64/openmpi/bin/rr_alloc /tmp/rr_alloc_mntpt/lustre/drr_alloc.parallel-scale/ash 555 2 "
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n lod.lustre-MDT0000-mdtlov.qos_threshold_rr=17%
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0000-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0001-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0002-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0003-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0004-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0005-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0006-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0007-osc-MDT0000.create_count=2048
       parallel-scale test_rr_alloc: @@@@@@ FAIL: Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/functions.sh:1120:run_rr_alloc()
        = /usr/lib64/lustre/tests/parallel-scale.sh:163:test_rr_alloc()
      

      which, except for the error message, is the same command and values for create_count seen when this test passes.

      From functions.sh/run_rr_alloc(), here is how we compute the differences between number of stripes created per OST

      1095         if [[ $total_MNTPTS -ne 0 ]]; then
      1096                 # Now start the actual file creation app.
      1097                 mpi_run "-np $total_MNTPTS" $cmd || return
      1098         else
      1099                 error "No mount point"
      1100         fi
      1101 
      1102         restore_lustre_params < $qos_prec_objs
      1103         rm -f $qos_prec_objs
      1104 
      1105         diff_max_min_arr=($($LFS getstripe -r $DIR/$tdir/ |
      1106                 grep "lmm_stripe_offset:" | awk '{print $2}' | sort -n |
      1107                 uniq -c | awk 'NR==1 {min=max=$1} \
      1108                 { $1<min ? min=$1 : min; $1>max ? max=$1 : max} \
      1109                 END {print max-min, max, min}'))
      1110 
      1111         rm -rf $DIR/$tdir
      1112 
      1113         # In-case of fairly large number of file creation using RR (round-robin)
      1114         # there can be two cases in which deviation will occur than the regular
      1115         # RR algo behaviour-
      1116         # 1- When rr_alloc does not start right with 'lqr_start_count' reseeded,
      1117         # 2- When rr_alloc does not finish with 'lqr_start_count == 0'.
      1118         # So the difference of files b/w any 2 OST should not be more than 2.
      1119         [[ ${diff_max_min_arr[0]} -le 2 ]] ||
      1120                 error "Uneven distribution detected: difference between" \
      1121                 "maximum files per OST (${diff_max_min_arr[1]}) and" \
      1122                 "minimum files per OST (${diff_max_min_arr[2]}) must not be" \
      1123                 "greater than 2"
      

      Attachments

        Issue Links

          Activity

            [LU-14377] parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''

            Patches already landed for this ticket, new ticket LU-17251 tracking new patches.

            adilger Andreas Dilger added a comment - Patches already landed for this ticket, new ticket LU-17251 tracking new patches.

            This is still failing regularly, often with very large differences between the most and least used OSTs (i.e. 90% difference instead of just 2-3% difference). So it seems either that the test as written is unreliable (creating too many objects, running on an already-imbalanced system, etc.), or there is a new bug in the code.

            adilger Andreas Dilger added a comment - This is still failing regularly, often with very large differences between the most and least used OSTs (i.e. 90% difference instead of just 2-3% difference). So it seems either that the test as written is unreliable (creating too many objects, running on an already-imbalanced system, etc.), or there is a new bug in the code.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51142/
            Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: c0b60c0c79a2d5d5be651570564d6d0407457a5f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51142/ Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: c0b60c0c79a2d5d5be651570564d6d0407457a5f

            "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51142
            Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 1194fd9de1d3853482a5c51574cd4ba91f1ce9ab

            gerrit Gerrit Updater added a comment - "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51142 Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 1194fd9de1d3853482a5c51574cd4ba91f1ce9ab
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48914/
            Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b104c0a27713899a4d047f56fed57c30c39b8195

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48914/ Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict Project: fs/lustre-release Branch: master Current Patch Set: Commit: b104c0a27713899a4d047f56fed57c30c39b8195

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48914
            Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e5456735dbb92dbb438bef45cdf8cbfc55ce99cc

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48914 Subject: LU-14377 tests: make parallel-scale/rr_alloc less strict Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e5456735dbb92dbb438bef45cdf8cbfc55ce99cc
            sarah Sarah Liu added a comment - +1 in 2.15.1 https://testing.whamcloud.com/test_sets/59410f79-0bed-4fd3-97fc-e80941e5a00c
            nangelinas Nikitas Angelinas added a comment - +1 on master: https://testing.whamcloud.com/test_sets/d09310fb-2944-46a6-84e3-67634a12f39d

            People

              adilger Andreas Dilger
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: