Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14377

parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.1, Lustre 2.15.3
    • None
    • ZFS
    • 3
    • 9223372036854775807

    Description

      parallel-scale test_rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''. It looks like this test has been failing since at least 22 NOV 2019 with
      Lustre 2.12.3.31 - https://testing.whamcloud.com/test_sets/b8fa5b22-0d76-11ea-98f1-52540065bddc
      Lustre 2.13.51.72 - https://testing.whamcloud.com/test_sets/32e5a306-44f4-11ea-8072-52540065bddc

      Since 2020-06-11, this test is occasionally failing.

      Looking at the test_suite log for the failure at https://testing.whamcloud.com/test_sets/875e0375-cc23-4f0f-8291-f4f9034e340c, we see

      + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -np 22 /usr/lib64/openmpi/bin/rr_alloc /tmp/rr_alloc_mntpt/lustre/drr_alloc.parallel-scale/ash 555 2 "
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n lod.lustre-MDT0000-mdtlov.qos_threshold_rr=17%
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0000-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0001-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0002-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0003-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0004-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0005-osc-MDT0000.create_count=2048
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0006-osc-MDT0000.create_count=1024
      CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0007-osc-MDT0000.create_count=2048
       parallel-scale test_rr_alloc: @@@@@@ FAIL: Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/functions.sh:1120:run_rr_alloc()
        = /usr/lib64/lustre/tests/parallel-scale.sh:163:test_rr_alloc()
      

      which, except for the error message, is the same command and values for create_count seen when this test passes.

      From functions.sh/run_rr_alloc(), here is how we compute the differences between number of stripes created per OST

      1095         if [[ $total_MNTPTS -ne 0 ]]; then
      1096                 # Now start the actual file creation app.
      1097                 mpi_run "-np $total_MNTPTS" $cmd || return
      1098         else
      1099                 error "No mount point"
      1100         fi
      1101 
      1102         restore_lustre_params < $qos_prec_objs
      1103         rm -f $qos_prec_objs
      1104 
      1105         diff_max_min_arr=($($LFS getstripe -r $DIR/$tdir/ |
      1106                 grep "lmm_stripe_offset:" | awk '{print $2}' | sort -n |
      1107                 uniq -c | awk 'NR==1 {min=max=$1} \
      1108                 { $1<min ? min=$1 : min; $1>max ? max=$1 : max} \
      1109                 END {print max-min, max, min}'))
      1110 
      1111         rm -rf $DIR/$tdir
      1112 
      1113         # In-case of fairly large number of file creation using RR (round-robin)
      1114         # there can be two cases in which deviation will occur than the regular
      1115         # RR algo behaviour-
      1116         # 1- When rr_alloc does not start right with 'lqr_start_count' reseeded,
      1117         # 2- When rr_alloc does not finish with 'lqr_start_count == 0'.
      1118         # So the difference of files b/w any 2 OST should not be more than 2.
      1119         [[ ${diff_max_min_arr[0]} -le 2 ]] ||
      1120                 error "Uneven distribution detected: difference between" \
      1121                 "maximum files per OST (${diff_max_min_arr[1]}) and" \
      1122                 "minimum files per OST (${diff_max_min_arr[2]}) must not be" \
      1123                 "greater than 2"
      

      Attachments

        Issue Links

          Activity

            People

              adilger Andreas Dilger
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: