Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-977

incorrect round robin object allocation

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • None
    • any lustre from a 1.6.0
    • 3
    • 24,194
    • 7276

    Description

      https://bugzilla.lustre.org/show_bug.cgi?id=24194

      bug issued due incorrect locking in lov_qos code and can be easy replicated by test

      diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c 
      index a101e9c..64ccefb 100644 
      --- a/lustre/lov/lov_qos.c 
      +++ b/lustre/lov/lov_qos.c 
      @@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, 
      
       repeat_find: 
               array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; 
      + CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); 
      + 
               idx_pos = idx_arr; 
       #ifdef QOS_DEBUG 
               CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d "
      
      test_51() {
              local obj1
              local obj2
              local old_rr
      
              mkdir -p $DIR1/$tfile-1/
              mkdir -p $DIR2/$tfile-2/
              old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e
      's/%//')
              do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100
      #define OBD_FAIL_MDS_LOV_CREATE_RACE     0x148
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148"
              touch $DIR1/$tfile-1/file1 &
              PID1=$!
              touch $DIR2/$tfile-2/file2 &
              PID2=$!
              wait $PID2
              wait $PID1
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
              do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr"
      
              obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1)
              obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2)
              [ $obj1 -eq $obj2 ] && error "must different ost used"
      }
      run_test 51 "alloc_rr should be allocate on correct order"
      

      bug found in 2.x but should be exist in 1.8 also.

      CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()

      Attachments

        Issue Links

          Activity

            [LU-977] incorrect round robin object allocation
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14636/
            Subject: LU-977 lod: Patch to protect lqr_start_idx
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d9b4bc5476c779aaaee6797e5e148b5e0b771980

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14636/ Subject: LU-977 lod: Patch to protect lqr_start_idx Project: fs/lustre-release Branch: master Current Patch Set: Commit: d9b4bc5476c779aaaee6797e5e148b5e0b771980

            Adding testing output demonstrating the fix :

            NOTE: The tests were run on real hardware on Lustre 2.5.1 version and with the fix where separate function was not created for the function calls inside the loop. I hope that's fine.

            WITHOUT Fix
            =============
            For reproducing the issue, a cluster with 12 OSTs and 12 client machines where used. Each client is a 24 cpu cores machine.
            IOR load is like the first example from CEAP-82 with smaller i/o transfer and block parameters.
            number of threads is 12 * 24, 24 threads per each client.
            To increase concurrency, each thread was operating through own lustre mount point, so each client had 24 lustre mounts.
            the main script:

            [CEAP-82]$ cat ceap-82.sh 
            #! /bin/bash
            
            
            MPIEXEC=/home/bloewe/libs/mpich-3.1/install/bin/mpiexec
            # EXE=/home/bloewe/benchmarks/mdtest-1.9.3/mdtest
            EXE=/home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR
            NPROCS=$((12 * 24))
            TARGET=/mnt/lustre/ceap-82
            MPISCRIPT=/home/bloewe/CEAP-82/mpi-script.sh
            
            HOSTS=$PWD/hostfile
            
            # OPTS="-a POSIX -B -C -E -F -e -g -k -b 4g -t 32m -vvv -o /lustre/crayadm/tmp/testdir.12403/IOR_POSIX"
            # OPTS="-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -vvv -o /mnt/lustre/ceap-82/IOR_POSIX"
            OPTS="-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -v"
            
            
            # mkdir -p distr
            # 
            # for n in {0..100}; do 
            # 	rm -f /mnt/lustre/ceap-82/IOR_POSIX*
            # 	$MPIEXEC -f $HOSTS -np $NPROCS $EXE $OPTS
            # 	for x in /mnt/lustre/ceap-82/IOR_POSIX*; do
            # 		lfs getstripe -i $x
            # 	done | sort -n | uniq -c > distr/d-$n
            # done
            
            rm -f /mnt/lustre/ceap-82/IOR_POSIX*
            $MPIEXEC -f $HOSTS -np $NPROCS $MPISCRIPT $OPTS
            for x in /mnt/lustre/ceap-82/IOR_POSIX*; do
                 lfs getstripe -i $x
            done | sort -n | uniq -c
            [CEAP-82]$ 
            

            and an IOR wrapper, needed to map 24 IOR instances to 24 lustre mount points:

            [CEAP-82]$ cat mpi-script.sh 
            #! /bin/bash
            
            LMOUNT=/mnt/lustre$(($PMI_RANK % 24))
            exec /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR "$@" -o $LMOUNT/ceap-82/IOR_POSIX
            [CEAP-82]$ 
            

            This way we simulate client load from 24 * 12 = 288 clients.
            There result should be equal distribution of IOR working files between all OSTs, 24 files per OST, if RR algorithm works correctly.

            However , result of the test run shows uneven files distribution:

            Max Write: 5988.51 MiB/sec (6279.41 MB/sec)
            Max Read: 10008.34 MiB/sec (10494.51 MB/sec)

            Run finished: Sat Mar 21 14:10:12 2015
            26 0
            25 1
            24 2
            29 3
            20 4
            23 5
            24 6
            25 7
            25 8
            24 9
            23 10
            20 11
            [CEAP-82]$

            number of files per OST varies from 20 to 29. It cannot be explained by "reseeds" in lod_rr_alloc().

            The log from MDS ssh session with setting qos_threshold_rr:

            [root@orange00 ~]# pdsh -g mds lctl set_param lod.*.qos_threshold_rr=100
            orange03: error: set_param: /proc/{fs,sys}/{lnet,lustre}/lod/*/qos_threshold_rr: Found no match
            pdsh@orange00: orange03: ssh exited with exit code 3
            orange10: lod.orangefs-MDT0001-mdtlov.qos_threshold_rr=100
            orange12: lod.orangefs-MDT0003-mdtlov.qos_threshold_rr=100
            orange11: lod.orangefs-MDT0002-mdtlov.qos_threshold_rr=100
            orange13: lod.orangefs-MDT0004-mdtlov.qos_threshold_rr=100
            orange02: lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100
            [root@orange00 ~]# slogin orange02
            Last login: Sat Mar 21 14:07:41 PDT 2015 from 172.16.2.3 on ssh
            [root@orange02 ~]# lctl set_param lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200
            error: set_param: /proc/{fs,sys}/{lnet,lustre}/lctl: Found no match
            debug=-1
            subsystem_debug=lov
            debug_mb=1200
            [root@orange02 ~]# lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200
            debug=-1
            subsystem_debug=lov
            debug_mb=1200
            [root@orange02 ~]# lctl dk > /dev/null
            [root@orange02 ~]# lctl dk /tmp/ceap-82.txt
            Debug log: 5776 lines, 5776 kept, 0 dropped, 0 bad.
            [root@orange02 ~]# lctl get_param lod.*.qos_threshold_rr
            lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100%
            [root@orange02 ~]# less /tmp/ceap-82.txt 
            [root@orange02 ~]# logout
            

            WITH Fix
            =======

            Following 5 runs showed equal files distribution across all OSTs:
             [CEAP-82]$ for n in {1..5} ; do sh ceap-82.sh; done
            IOR-2.10.3: MPI Coordinated Test of Parallel I/O
            
            Run began: Mon Mar 23 02:27:28 2015
            Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
            Machine: Linux sjsc-321
            
            Summary:
            	api                = POSIX
            	test filename      = /mnt/lustre0/ceap-82/IOR_POSIX
            	access             = file-per-process
            	ordering in a file = sequential offsets
            	ordering inter file=constant task offsets = 1
            	clients            = 288 (24 per node)
            	repetitions        = 1
            	xfersize           = 2 MiB
            	blocksize          = 200 MiB
            	aggregate filesize = 56.25 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write        4310.41    4310.41     4310.41      0.00    2155.21    2155.21     2155.21      0.00  13.36299   EXCEL
            read        10076.77   10076.77    10076.77      0.00    5038.38    5038.38     5038.38      0.00   5.71612   EXCEL
            
            Max Write: 4310.41 MiB/sec (4519.80 MB/sec)
            Max Read:  10076.77 MiB/sec (10566.26 MB/sec)
            
            Run finished: Mon Mar 23 02:27:48 2015
                 24 0
                 24 1
                 24 2
                 24 3
                 24 4
                 24 5
                 24 6
                 24 7
                 24 8
                 24 9
                 24 10
                 24 11
            IOR-2.10.3: MPI Coordinated Test of Parallel I/O
            
            Run began: Mon Mar 23 02:27:50 2015
            Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
            Machine: Linux sjsc-321
            
            Summary:
            	api                = POSIX
            	test filename      = /mnt/lustre0/ceap-82/IOR_POSIX
            	access             = file-per-process
            	ordering in a file = sequential offsets
            	ordering inter file=constant task offsets = 1
            	clients            = 288 (24 per node)
            	repetitions        = 1
            	xfersize           = 2 MiB
            	blocksize          = 200 MiB
            	aggregate filesize = 56.25 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write        4269.62    4269.62     4269.62      0.00    2134.81    2134.81     2134.81      0.00  13.49066   EXCEL
            read        10078.78   10078.78    10078.78      0.00    5039.39    5039.39     5039.39      0.00   5.71498   EXCEL
            
            Max Write: 4269.62 MiB/sec (4477.02 MB/sec)
            Max Read:  10078.78 MiB/sec (10568.37 MB/sec)
            
            Run finished: Mon Mar 23 02:28:10 2015
                 24 0
                 24 1
                 24 2
                 24 3
                 24 4
                 24 5
                 24 6
                 24 7
                 24 8
                 24 9
                 24 10
                 24 11
            IOR-2.10.3: MPI Coordinated Test of Parallel I/O
            
            Run began: Mon Mar 23 02:28:11 2015
            Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
            Machine: Linux sjsc-321
            
            Summary:
            	api                = POSIX
            	test filename      = /mnt/lustre0/ceap-82/IOR_POSIX
            	access             = file-per-process
            	ordering in a file = sequential offsets
            	ordering inter file=constant task offsets = 1
            	clients            = 288 (24 per node)
            	repetitions        = 1
            	xfersize           = 2 MiB
            	blocksize          = 200 MiB
            	aggregate filesize = 56.25 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write        4343.86    4343.86     4343.86      0.00    2171.93    2171.93     2171.93      0.00  13.26011   EXCEL
            read        10091.59   10091.59    10091.59      0.00    5045.80    5045.80     5045.80      0.00   5.70772   EXCEL
            
            Max Write: 4343.86 MiB/sec (4554.86 MB/sec)
            Max Read:  10091.59 MiB/sec (10581.80 MB/sec)
            
            Run finished: Mon Mar 23 02:28:31 2015
                 24 0
                 24 1
                 24 2
                 24 3
                 24 4
                 24 5
                 24 6
                 24 7
                 24 8
                 24 9
                 24 10
                 24 11
            IOR-2.10.3: MPI Coordinated Test of Parallel I/O
            
            Run began: Mon Mar 23 02:28:33 2015
            Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
            Machine: Linux sjsc-321
            
            Summary:
            	api                = POSIX
            	test filename      = /mnt/lustre0/ceap-82/IOR_POSIX
            	access             = file-per-process
            	ordering in a file = sequential offsets
            	ordering inter file=constant task offsets = 1
            	clients            = 288 (24 per node)
            	repetitions        = 1
            	xfersize           = 2 MiB
            	blocksize          = 200 MiB
            	aggregate filesize = 56.25 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write        4169.55    4169.55     4169.55      0.00    2084.78    2084.78     2084.78      0.00  13.81443   EXCEL
            read         9997.98    9997.98     9997.98      0.00    4998.99    4998.99     4998.99      0.00   5.76116   EXCEL
            
            Max Write: 4169.55 MiB/sec (4372.09 MB/sec)
            Max Read:  9997.98 MiB/sec (10483.64 MB/sec)
            
            Run finished: Mon Mar 23 02:28:53 2015
                 24 0
                 24 1
                 24 2
                 24 3
                 24 4
                 24 5
                 24 6
                 24 7
                 24 8
                 24 9
                 24 10
                 24 11
            IOR-2.10.3: MPI Coordinated Test of Parallel I/O
            
            Run began: Mon Mar 23 02:28:56 2015
            Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
            Machine: Linux sjsc-321
            
            Summary:
            	api                = POSIX
            	test filename      = /mnt/lustre0/ceap-82/IOR_POSIX
            	access             = file-per-process
            	ordering in a file = sequential offsets
            	ordering inter file=constant task offsets = 1
            	clients            = 288 (24 per node)
            	repetitions        = 1
            	xfersize           = 2 MiB
            	blocksize          = 200 MiB
            	aggregate filesize = 56.25 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write        4427.20    4427.20     4427.20      0.00    2213.60    2213.60     2213.60      0.00  13.01049   EXCEL
            read        10026.62   10026.62    10026.62      0.00    5013.31    5013.31     5013.31      0.00   5.74471   EXCEL
            
            Max Write: 4427.20 MiB/sec (4642.25 MB/sec)
            Max Read:  10026.62 MiB/sec (10513.67 MB/sec)
            
            Run finished: Mon Mar 23 02:29:15 2015
                 24 0
                 24 1
                 24 2
                 24 3
                 24 4
                 24 5
                 24 6
                 24 7
                 24 8
                 24 9
                 24 10
                 24 11
            [CEAP-82]$ 
            
            520557 Rahul Deshmukh (Inactive) added a comment - Adding testing output demonstrating the fix : NOTE: The tests were run on real hardware on Lustre 2.5.1 version and with the fix where separate function was not created for the function calls inside the loop. I hope that's fine. WITHOUT Fix ============= For reproducing the issue, a cluster with 12 OSTs and 12 client machines where used. Each client is a 24 cpu cores machine. IOR load is like the first example from CEAP-82 with smaller i/o transfer and block parameters. number of threads is 12 * 24, 24 threads per each client. To increase concurrency, each thread was operating through own lustre mount point, so each client had 24 lustre mounts. the main script: [CEAP-82]$ cat ceap-82.sh #! /bin/bash MPIEXEC=/home/bloewe/libs/mpich-3.1/install/bin/mpiexec # EXE=/home/bloewe/benchmarks/mdtest-1.9.3/mdtest EXE=/home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR NPROCS=$((12 * 24)) TARGET=/mnt/lustre/ceap-82 MPISCRIPT=/home/bloewe/CEAP-82/mpi-script.sh HOSTS=$PWD/hostfile # OPTS= "-a POSIX -B -C -E -F -e -g -k -b 4g -t 32m -vvv -o /lustre/crayadm/tmp/testdir.12403/IOR_POSIX" # OPTS= "-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -vvv -o /mnt/lustre/ceap-82/IOR_POSIX" OPTS= "-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -v" # mkdir -p distr # # for n in {0..100}; do # rm -f /mnt/lustre/ceap-82/IOR_POSIX* # $MPIEXEC -f $HOSTS -np $NPROCS $EXE $OPTS # for x in /mnt/lustre/ceap-82/IOR_POSIX*; do # lfs getstripe -i $x # done | sort -n | uniq -c > distr/d-$n # done rm -f /mnt/lustre/ceap-82/IOR_POSIX* $MPIEXEC -f $HOSTS -np $NPROCS $MPISCRIPT $OPTS for x in /mnt/lustre/ceap-82/IOR_POSIX*; do lfs getstripe -i $x done | sort -n | uniq -c [CEAP-82]$ and an IOR wrapper, needed to map 24 IOR instances to 24 lustre mount points: [CEAP-82]$ cat mpi-script.sh #! /bin/bash LMOUNT=/mnt/lustre$(($PMI_RANK % 24)) exec /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR "$@" -o $LMOUNT/ceap-82/IOR_POSIX [CEAP-82]$ This way we simulate client load from 24 * 12 = 288 clients. There result should be equal distribution of IOR working files between all OSTs, 24 files per OST, if RR algorithm works correctly. However , result of the test run shows uneven files distribution: Max Write: 5988.51 MiB/sec (6279.41 MB/sec) Max Read: 10008.34 MiB/sec (10494.51 MB/sec) Run finished: Sat Mar 21 14:10:12 2015 26 0 25 1 24 2 29 3 20 4 23 5 24 6 25 7 25 8 24 9 23 10 20 11 [CEAP-82] $ number of files per OST varies from 20 to 29. It cannot be explained by "reseeds" in lod_rr_alloc(). The log from MDS ssh session with setting qos_threshold_rr: [root@orange00 ~]# pdsh -g mds lctl set_param lod.*.qos_threshold_rr=100 orange03: error: set_param: /proc/{fs,sys}/{lnet,lustre}/lod/*/qos_threshold_rr: Found no match pdsh@orange00: orange03: ssh exited with exit code 3 orange10: lod.orangefs-MDT0001-mdtlov.qos_threshold_rr=100 orange12: lod.orangefs-MDT0003-mdtlov.qos_threshold_rr=100 orange11: lod.orangefs-MDT0002-mdtlov.qos_threshold_rr=100 orange13: lod.orangefs-MDT0004-mdtlov.qos_threshold_rr=100 orange02: lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100 [root@orange00 ~]# slogin orange02 Last login: Sat Mar 21 14:07:41 PDT 2015 from 172.16.2.3 on ssh [root@orange02 ~]# lctl set_param lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200 error: set_param: /proc/{fs,sys}/{lnet,lustre}/lctl: Found no match debug=-1 subsystem_debug=lov debug_mb=1200 [root@orange02 ~]# lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200 debug=-1 subsystem_debug=lov debug_mb=1200 [root@orange02 ~]# lctl dk > /dev/ null [root@orange02 ~]# lctl dk /tmp/ceap-82.txt Debug log: 5776 lines, 5776 kept, 0 dropped, 0 bad. [root@orange02 ~]# lctl get_param lod.*.qos_threshold_rr lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100% [root@orange02 ~]# less /tmp/ceap-82.txt [root@orange02 ~]# logout WITH Fix ======= Following 5 runs showed equal files distribution across all OSTs: [CEAP-82]$ for n in {1..5} ; do sh ceap-82.sh; done IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Mon Mar 23 02:27:28 2015 Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX Machine: Linux sjsc-321 Summary: api = POSIX test filename = /mnt/lustre0/ceap-82/IOR_POSIX access = file-per-process ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 288 (24 per node) repetitions = 1 xfersize = 2 MiB blocksize = 200 MiB aggregate filesize = 56.25 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 4310.41 4310.41 4310.41 0.00 2155.21 2155.21 2155.21 0.00 13.36299 EXCEL read 10076.77 10076.77 10076.77 0.00 5038.38 5038.38 5038.38 0.00 5.71612 EXCEL Max Write: 4310.41 MiB/sec (4519.80 MB/sec) Max Read: 10076.77 MiB/sec (10566.26 MB/sec) Run finished: Mon Mar 23 02:27:48 2015 24 0 24 1 24 2 24 3 24 4 24 5 24 6 24 7 24 8 24 9 24 10 24 11 IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Mon Mar 23 02:27:50 2015 Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX Machine: Linux sjsc-321 Summary: api = POSIX test filename = /mnt/lustre0/ceap-82/IOR_POSIX access = file-per-process ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 288 (24 per node) repetitions = 1 xfersize = 2 MiB blocksize = 200 MiB aggregate filesize = 56.25 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 4269.62 4269.62 4269.62 0.00 2134.81 2134.81 2134.81 0.00 13.49066 EXCEL read 10078.78 10078.78 10078.78 0.00 5039.39 5039.39 5039.39 0.00 5.71498 EXCEL Max Write: 4269.62 MiB/sec (4477.02 MB/sec) Max Read: 10078.78 MiB/sec (10568.37 MB/sec) Run finished: Mon Mar 23 02:28:10 2015 24 0 24 1 24 2 24 3 24 4 24 5 24 6 24 7 24 8 24 9 24 10 24 11 IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Mon Mar 23 02:28:11 2015 Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX Machine: Linux sjsc-321 Summary: api = POSIX test filename = /mnt/lustre0/ceap-82/IOR_POSIX access = file-per-process ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 288 (24 per node) repetitions = 1 xfersize = 2 MiB blocksize = 200 MiB aggregate filesize = 56.25 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 4343.86 4343.86 4343.86 0.00 2171.93 2171.93 2171.93 0.00 13.26011 EXCEL read 10091.59 10091.59 10091.59 0.00 5045.80 5045.80 5045.80 0.00 5.70772 EXCEL Max Write: 4343.86 MiB/sec (4554.86 MB/sec) Max Read: 10091.59 MiB/sec (10581.80 MB/sec) Run finished: Mon Mar 23 02:28:31 2015 24 0 24 1 24 2 24 3 24 4 24 5 24 6 24 7 24 8 24 9 24 10 24 11 IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Mon Mar 23 02:28:33 2015 Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX Machine: Linux sjsc-321 Summary: api = POSIX test filename = /mnt/lustre0/ceap-82/IOR_POSIX access = file-per-process ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 288 (24 per node) repetitions = 1 xfersize = 2 MiB blocksize = 200 MiB aggregate filesize = 56.25 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 4169.55 4169.55 4169.55 0.00 2084.78 2084.78 2084.78 0.00 13.81443 EXCEL read 9997.98 9997.98 9997.98 0.00 4998.99 4998.99 4998.99 0.00 5.76116 EXCEL Max Write: 4169.55 MiB/sec (4372.09 MB/sec) Max Read: 9997.98 MiB/sec (10483.64 MB/sec) Run finished: Mon Mar 23 02:28:53 2015 24 0 24 1 24 2 24 3 24 4 24 5 24 6 24 7 24 8 24 9 24 10 24 11 IOR-2.10.3: MPI Coordinated Test of Parallel I/O Run began: Mon Mar 23 02:28:56 2015 Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX Machine: Linux sjsc-321 Summary: api = POSIX test filename = /mnt/lustre0/ceap-82/IOR_POSIX access = file-per-process ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 288 (24 per node) repetitions = 1 xfersize = 2 MiB blocksize = 200 MiB aggregate filesize = 56.25 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 4427.20 4427.20 4427.20 0.00 2213.60 2213.60 2213.60 0.00 13.01049 EXCEL read 10026.62 10026.62 10026.62 0.00 5013.31 5013.31 5013.31 0.00 5.74471 EXCEL Max Write: 4427.20 MiB/sec (4642.25 MB/sec) Max Read: 10026.62 MiB/sec (10513.67 MB/sec) Run finished: Mon Mar 23 02:29:15 2015 24 0 24 1 24 2 24 3 24 4 24 5 24 6 24 7 24 8 24 9 24 10 24 11 [CEAP-82]$

            Posted the new patch to address this issue. Please review.

            520557 Rahul Deshmukh (Inactive) added a comment - Posted the new patch to address this issue. Please review.

            Rahul Deshmukh (rahul.deshmukh@seagate.com) uploaded a new patch: http://review.whamcloud.com/14636
            Subject: LU-977 lod: Patch to protect lqr_start_idx
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 86aa10e5b8c6b944c10ce224078ba4f6aafbe6eb

            gerrit Gerrit Updater added a comment - Rahul Deshmukh (rahul.deshmukh@seagate.com) uploaded a new patch: http://review.whamcloud.com/14636 Subject: LU-977 lod: Patch to protect lqr_start_idx Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 86aa10e5b8c6b944c10ce224078ba4f6aafbe6eb

            Cory, notice "totally", which is possible only with very strong locking around allocation, IMHO. which is virtually not possible with DNE and not very good on multicore?

            bzzz Alex Zhuravlev added a comment - Cory, notice "totally", which is possible only with very strong locking around allocation, IMHO. which is virtually not possible with DNE and not very good on multicore?

            If i correctly understand Alex, they mean different MDT may allocate on same OST so OST may have a different allocated objects. But it's false if each MDT have own OST pools assigned.

            shadow Alexey Lyashkov added a comment - If i correctly understand Alex, they mean different MDT may allocate on same OST so OST may have a different allocated objects. But it's false if each MDT have own OST pools assigned.
            spitzcor Cory Spitz added a comment -

            I don't see how 'precise' RR is not possible with DNE. If an application wants evenly balanced stripe allocation, that should still be possible as the allocators aren't linked in DNE. So then if the one MDS allocator hasn't switched to the space-based allocator, then round-robin should still be (mostly) 'precise', correct?

            spitzcor Cory Spitz added a comment - I don't see how 'precise' RR is not possible with DNE. If an application wants evenly balanced stripe allocation, that should still be possible as the allocators aren't linked in DNE. So then if the one MDS allocator hasn't switched to the space-based allocator, then round-robin should still be (mostly) 'precise', correct?

            totally precise RR is not possible with DNE, for example.

            bzzz Alex Zhuravlev added a comment - totally precise RR is not possible with DNE, for example.
            Eugene Birkine added a comment - 06/Dec/11 9:05 PM
            Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was:
            testfs-OST0000
            2
            testfs-OST0001
            3
            testfs-OST0002
            2
            testfs-OST0003
            1
            testfs-OST0004
            2
            testfs-OST0005
            3
            testfs-OST0006
            1
            testfs-OST0007
            2
            
            shadow Alexey Lyashkov added a comment - Eugene Birkine added a comment - 06/Dec/11 9:05 PM Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was: testfs-OST0000 2 testfs-OST0001 3 testfs-OST0002 2 testfs-OST0003 1 testfs-OST0004 2 testfs-OST0005 3 testfs-OST0006 1 testfs-OST0007 2

            People

              bogl Bob Glossman (Inactive)
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: