Details

    • Improvement
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.3.0
    • 6208

    Description

      (I'm not sure about the issue type for this ticket, please adjust as appropriate.)

      As discussed with Peter Jones, we are trying to implement a file system where single clients can achieve >900MB/s write throughput over 10GigE connections. Ideally single 10GigE for the clients but 2x10GigE LACP bonding might be an option. The OSSes will initially have 4x 10GigE LACP bonded links, though for some initial testing we might start with fewer links.

      The disk backend has now arrived and this is a sample obdfilter-survey result using all one OST and 4 OSSes, without much tuning on the OSS nodes yet. The OSSes are all running Lustre 2.3.0 on RHEL6.

      Sat Jan 19 15:49:23 GMT 2013 Obdfilter-survey for case=disk from cs04r-sc-oss05-03.diamond.ac.uk
      ost 41 sz 687865856K rsz 1024K obj   41 thr   41 write 2975.14 [  40.00, 105.99] rewrite 2944.84 [  22.00, 118.99] read 8104.33 [  40.99, 231.98]
      ost 41 sz 687865856K rsz 1024K obj   41 thr   82 write 5231.39 [  49.99, 167.98] rewrite 4984.58 [  29.98, 171.89] read 13807.08 [ 161.99, 514.92]
      ost 41 sz 687865856K rsz 1024K obj   41 thr  164 write 9445.93 [  82.99, 293.98] rewrite 9722.32 [ 149.98, 324.96] read 17851.10 [ 191.97, 869.92]
      ost 41 sz 687865856K rsz 1024K obj   41 thr  328 write 15872.41 [ 265.96, 533.94] rewrite 16682.58 [ 245.97, 526.97] read 19312.61 [ 184.98, 794.93]
      ost 41 sz 687865856K rsz 1024K obj   41 thr  656 write 18704.47 [ 222.98, 651.94] rewrite 18733.29 [ 252.90, 634.83] read 21040.28 [ 260.98, 808.92]
      ost 41 sz 687865856K rsz 1024K obj   41 thr 1312 write 18291.71 [ 161.99, 740.93] rewrite 18443.63 [  47.00, 704.91] read 20683.56 [ 178.99, 908.91]
      ost 41 sz 687865856K rsz 1024K obj   41 thr 2624 write 18704.50 [  19.00, 684.92] rewrite 18583.81 [  25.00, 729.92] read 20400.08 [ 110.99, 982.88]
      ost 41 sz 687865856K rsz 1024K obj   82 thr   82 write 5634.08 [  62.99, 176.98] rewrite 4640.45 [  55.00, 162.98] read 9459.26 [ 114.98, 320.99]
      ost 41 sz 687865856K rsz 1024K obj   82 thr  164 write 9615.85 [  95.99, 308.98] rewrite 8329.19 [ 122.99, 275.99] read 13967.03 [ 150.99, 430.97]
      ost 41 sz 687865856K rsz 1024K obj   82 thr  328 write 13846.63 [ 229.99, 461.97] rewrite 12576.55 [ 186.98, 390.97] read 18166.27 [ 130.99, 557.94]
      ost 41 sz 687865856K rsz 1024K obj   82 thr  656 write 18558.35 [ 268.98, 624.93] rewrite 16821.93 [ 246.85, 542.95] read 19645.73 [ 235.85, 676.92]
      ost 41 sz 687865856K rsz 1024K obj   82 thr 1312 write 18885.19 [ 117.99, 690.92] rewrite 16501.04 [ 115.99, 617.95] read 19255.26 [ 180.97, 832.89]
      ost 41 sz 687865856K rsz 1024K obj   82 thr 2624 write 18991.31 [ 127.51, 784.92] rewrite 18111.05 [  31.00, 763.88] read 20333.42 [ 124.48, 997.82]
      ost 41 sz 687865856K rsz 1024K obj  164 thr  164 write 7513.17 [  69.99, 236.95] rewrite 5611.77 [  65.00, 198.96] read 12950.03 [  80.99, 383.96]
      ost 41 sz 687865856K rsz 1024K obj  164 thr  328 write 13191.77 [ 216.99, 361.98] rewrite 10104.73 [ 129.99, 313.98] read 18380.92 [ 149.98, 529.97]
      ost 41 sz 687865856K rsz 1024K obj  164 thr  656 write 16442.83 [ 168.98, 494.91] rewrite 14155.27 [ 213.98, 452.97] read 19564.97 [ 238.85, 616.95]
      ost 41 sz 687865856K rsz 1024K obj  164 thr 1312 write 18070.58 [ 152.96, 612.91] rewrite 15744.41 [  62.99, 540.96] read 18846.31 [ 160.99, 660.84]
      ost 41 sz 687865856K rsz 1024K obj  164 thr 2624 write 18664.83 [ 138.97, 767.93] rewrite 16648.63 [  81.28, 603.93] read 19319.91 [  79.97, 864.90]
      ost 41 sz 687865856K rsz 1024K obj  328 thr  328 write 9028.81 [  66.00, 277.97] rewrite 6807.19 [  42.99, 228.98] read 14799.75 [ 123.98, 491.92]
      ost 41 sz 687865856K rsz 1024K obj  328 thr  656 write 14471.67 [ 155.98, 427.97] rewrite 11632.72 [ 130.99, 375.98] read 19137.29 [ 127.79, 595.92]
      ost 41 sz 687865856K rsz 1024K obj  328 thr 1312 write 17084.20 [ 179.98, 533.95] rewrite 13810.96 [  64.00, 449.96] read 18405.80 [ 182.98, 616.95]
      ost 41 sz 687865856K rsz 1024K obj  328 thr 2624 write 18583.14 [  24.99, 684.92] rewrite 15588.87 [  68.99, 579.93] read 18857.33 [ 160.98, 706.96]
      ost 41 sz 687865856K rsz 1024K obj  656 thr  656 write 9861.09 [ 121.98, 312.96] rewrite 7540.60 [  70.00, 258.96] read 15160.96 [ 193.96, 483.94]
      ost 41 sz 687865856K rsz 1024K obj  656 thr 1312 write 15021.83 [ 175.97, 450.95] rewrite 11641.17 [  97.99, 389.98] read 18470.04 [ 205.99, 597.91]
      ost 41 sz 687865856K rsz 1024K obj  656 thr 2624 write 17202.58 [  84.98, 589.90] rewrite 14483.38 [ 143.98, 491.91] read 18475.50 [ 179.98, 631.94]
      

      We have not yet done any tests with clients (in fact the 10GigE network still needs to be configured) but I would like to ask if there is any reason why we should not achieve our goal with this storage hardware.

      I will also update the ticket once we've done some tests with clients.

      Attachments

        Activity

          [LU-2659] single client throughput for 10GigE

          here you go (this seems slower?):

          [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir stripe-20-1
          [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 20 stripe-20-1
          [bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/stripe-20-1
          [bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
          IOR-3.0.0: MPI Coordinated Test of Parallel I/O
          
          Began: Fri Feb 15 17:01:36 2013
          Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20-1/ior_dat -w -k -t1m -b 20g -i 1 -e
          Machine: Linux cs04r-sc-serv-68.diamond.ac.uk
          
          Test 0 started: Fri Feb 15 17:01:36 2013
          Summary:
                  api                = POSIX
                  test filename      = /mnt/lustre-test/frederik1/stripe-20-1/ior_dat
                  access             = single-shared-file
                  ordering in a file = sequential offsets
                  ordering inter file= no tasks offsets
                  clients            = 1 (1 per node)
                  repetitions        = 1
                  xfersize           = 1 MiB
                  blocksize          = 20 GiB
                  aggregate filesize = 20 GiB
          
          access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
          ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
          write     412.76     20971520   1024.00    0.000905   49.62      0.000229   49.62      0   
          
          Max Write: 412.76 MiB/sec (432.81 MB/sec)
          
          Summary of all tests:
          Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
          write         412.76     412.76     412.76       0.00   49.61738 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
          
          Finished: Fri Feb 15 17:02:25 2013
          [bnh65367@cs04r-sc-serv-68 frederik1]$ 
          
          ferner Frederik Ferner (Inactive) added a comment - here you go (this seems slower?): [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir stripe-20-1 [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 20 stripe-20-1 [bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/stripe-20-1 [bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e IOR-3.0.0: MPI Coordinated Test of Parallel I/O Began: Fri Feb 15 17:01:36 2013 Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20-1/ior_dat -w -k -t1m -b 20g -i 1 -e Machine: Linux cs04r-sc-serv-68.diamond.ac.uk Test 0 started: Fri Feb 15 17:01:36 2013 Summary: api = POSIX test filename = /mnt/lustre-test/frederik1/stripe-20-1/ior_dat access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 20 GiB aggregate filesize = 20 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 412.76 20971520 1024.00 0.000905 49.62 0.000229 49.62 0 Max Write: 412.76 MiB/sec (432.81 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 412.76 412.76 412.76 0.00 49.61738 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0 Finished: Fri Feb 15 17:02:25 2013 [bnh65367@cs04r-sc-serv-68 frederik1]$
          mdiep Minh Diep added a comment -

          Hi Frederick,

          Thanks for the quick response. Could you try again with lfs setstripe -c 20? thanks

          mdiep Minh Diep added a comment - Hi Frederick, Thanks for the quick response. Could you try again with lfs setstripe -c 20? thanks

          Also, the reason I tried the lnet selftest with concurrency one is because my feeling is that this might be close to what happens for single process writes. Looking at the two numbers in throughput (concurrency=1 lnet selftest and single process ior), these see very close to each other.

          Also the other day I did a test while watching /proc/sys/lnet/peers every 1/10th second and there was only ever one of the two nids with anything reported as queued. Not sure if this is relevant or not...

          ferner Frederik Ferner (Inactive) added a comment - Also, the reason I tried the lnet selftest with concurrency one is because my feeling is that this might be close to what happens for single process writes. Looking at the two numbers in throughput (concurrency=1 lnet selftest and single process ior), these see very close to each other. Also the other day I did a test while watching /proc/sys/lnet/peers every 1/10th second and there was only ever one of the two nids with anything reported as queued. Not sure if this is relevant or not...

          Minh,

          570MiB/s see below (for just one iteration)

          [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir single-oss
          [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 1 single-oss/
          [bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss
          [bnh65367@cs04r-sc-serv-68 frederik1]$ export NSLOTS=1
          [bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
          IOR-3.0.0: MPI Coordinated Test of Parallel I/O
          
          Began: Fri Feb 15 16:37:50 2013
          Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss/ior_dat -w -k -t1m -b 20g -i 1 -e
          Machine: Linux cs04r-sc-serv-68.diamond.ac.uk
          
          Test 0 started: Fri Feb 15 16:37:50 2013
          Summary:
                  api                = POSIX
                  test filename      = /mnt/lustre-test/frederik1/single-oss/ior_dat
                  access             = single-shared-file
                  ordering in a file = sequential offsets
                  ordering inter file= no tasks offsets
                  clients            = 1 (1 per node)
                  repetitions        = 1
                  xfersize           = 1 MiB
                  blocksize          = 20 GiB
                  aggregate filesize = 20 GiB
          
          access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
          ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
          write     570.96     20971520   1024.00    0.000590   35.87      0.000212   35.87      0   
          
          Max Write: 570.96 MiB/sec (598.69 MB/sec)
          
          Summary of all tests:
          Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
          write         570.96     570.96     570.96       0.00   35.86953 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
          
          Finished: Fri Feb 15 16:38:26 2013
          [bnh65367@cs04r-sc-serv-68 frederik1]$ 
          
          ferner Frederik Ferner (Inactive) added a comment - Minh, 570MiB/s see below (for just one iteration) [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir single-oss [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 1 single-oss/ [bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss [bnh65367@cs04r-sc-serv-68 frederik1]$ export NSLOTS=1 [bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e IOR-3.0.0: MPI Coordinated Test of Parallel I/O Began: Fri Feb 15 16:37:50 2013 Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss/ior_dat -w -k -t1m -b 20g -i 1 -e Machine: Linux cs04r-sc-serv-68.diamond.ac.uk Test 0 started: Fri Feb 15 16:37:50 2013 Summary: api = POSIX test filename = /mnt/lustre-test/frederik1/single-oss/ior_dat access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 20 GiB aggregate filesize = 20 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 570.96 20971520 1024.00 0.000590 35.87 0.000212 35.87 0 Max Write: 570.96 MiB/sec (598.69 MB/sec) Summary of all tests: Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum write 570.96 570.96 570.96 0.00 35.86953 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0 Finished: Fri Feb 15 16:38:26 2013 [bnh65367@cs04r-sc-serv-68 frederik1]$
          mdiep Minh Diep added a comment -

          Frederik,

          I suggest we measure how much an OSS can deliver to one client. You can achieve this by lfs setstripe -c 10 -o 1 <ior dir>. Please try ior with -np 1 and let me know.

          Thanks

          mdiep Minh Diep added a comment - Frederik, I suggest we measure how much an OSS can deliver to one client. You can achieve this by lfs setstripe -c 10 -o 1 <ior dir>. Please try ior with -np 1 and let me know. Thanks

          Minh,

          not sure if I mentioned this, but I had to reduce my file system to only 20 OSTs on 2 OSSes as I had to start investigating alternatives on the rest of the hardware. Here is the requested lctl dl -t output.

          [bnh65367@cs04r-sc-serv-68 frederik1]$ lctl dl -t
          0 UP mgc MGC172.23.66.29@tcp 59376037-38a3-b21b-689c-217c3f9bd463 5
          1 UP lov spfs1-clilov-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4
          2 UP lmv spfs1-clilmv-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4
          3 UP mdc spfs1-MDT0000-mdc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          4 UP osc spfs1-OST0001-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          5 UP osc spfs1-OST0002-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          6 UP osc spfs1-OST0003-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          7 UP osc spfs1-OST0004-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          8 UP osc spfs1-OST0005-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          9 UP osc spfs1-OST0006-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          10 UP osc spfs1-OST0007-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          11 UP osc spfs1-OST0008-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          12 UP osc spfs1-OST0009-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          13 UP osc spfs1-OST0000-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
          14 UP osc spfs1-OST000a-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          15 UP osc spfs1-OST000b-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          16 UP osc spfs1-OST000c-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          17 UP osc spfs1-OST000d-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          18 UP osc spfs1-OST000e-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          19 UP osc spfs1-OST000f-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          20 UP osc spfs1-OST0010-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          21 UP osc spfs1-OST0011-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          22 UP osc spfs1-OST0012-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          23 UP osc spfs1-OST0013-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
          [bnh65367@cs04r-sc-serv-68 frederik1]$

          ferner Frederik Ferner (Inactive) added a comment - Minh, not sure if I mentioned this, but I had to reduce my file system to only 20 OSTs on 2 OSSes as I had to start investigating alternatives on the rest of the hardware. Here is the requested lctl dl -t output. [bnh65367@cs04r-sc-serv-68 frederik1] $ lctl dl -t 0 UP mgc MGC172.23.66.29@tcp 59376037-38a3-b21b-689c-217c3f9bd463 5 1 UP lov spfs1-clilov-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4 2 UP lmv spfs1-clilmv-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4 3 UP mdc spfs1-MDT0000-mdc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 4 UP osc spfs1-OST0001-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 5 UP osc spfs1-OST0002-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 6 UP osc spfs1-OST0003-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 7 UP osc spfs1-OST0004-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 8 UP osc spfs1-OST0005-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 9 UP osc spfs1-OST0006-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 10 UP osc spfs1-OST0007-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 11 UP osc spfs1-OST0008-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 12 UP osc spfs1-OST0009-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 13 UP osc spfs1-OST0000-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp 14 UP osc spfs1-OST000a-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 15 UP osc spfs1-OST000b-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 16 UP osc spfs1-OST000c-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 17 UP osc spfs1-OST000d-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 18 UP osc spfs1-OST000e-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 19 UP osc spfs1-OST000f-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 20 UP osc spfs1-OST0010-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 21 UP osc spfs1-OST0011-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 22 UP osc spfs1-OST0012-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp 23 UP osc spfs1-OST0013-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp [bnh65367@cs04r-sc-serv-68 frederik1] $
          mdiep Minh Diep added a comment -

          hi, could you print out the lctl dl -t from your client? -thanks

          mdiep Minh Diep added a comment - hi, could you print out the lctl dl -t from your client? -thanks
          mdiep Minh Diep added a comment -

          yes, it's because you set concurrency=1. it's like running a single thread.

          mdiep Minh Diep added a comment - yes, it's because you set concurrency=1. it's like running a single thread.

          The throughput in the previous test was good, though I've noticed that the throughput seems to drop to about 550MB/s if I use one client, one server and reduce concurrency to 1, I wonder if that is related to the single stream performance that we experience? Is the client effectively only ever writing to one server at a time or something similar?

          [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start -s cs04r-sc-oss05-03-10g -C 1
          CONCURRENCY=1
          session is ended
          SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
          cs04r-sc-serv-68-10g are added to session
          cs04r-sc-oss05-03-10g are added to session
          Test was added successfully
          Test was added successfully
          b is running now
          Batch: b Tests: 2 State: 177
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  1       0       0       0       1
                  Test 1(brw) (loop: 1800000, concurrency: 1)
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  1       0       0       0       1
                  Test 2(brw) (loop: 1800000, concurrency: 1)
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  1       0       0       0       1
          
          [LNet Rates of c]
          [R] Avg: 2245     RPC/s Min: 2245     RPC/s Max: 2245     RPC/s
          [W] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 557.79   MB/s  Min: 557.79   MB/s  Max: 557.79   MB/s
          [W] Avg: 565.19   MB/s  Min: 565.19   MB/s  Max: 565.19   MB/s
          [LNet Rates of s]
          [R] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
          [W] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 564.91   MB/s  Min: 564.91   MB/s  Max: 564.91   MB/s
          [W] Avg: 557.47   MB/s  Min: 557.47   MB/s  Max: 557.47   MB/s
          [LNet Rates of c]
          [R] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
          [W] Avg: 1689     RPC/s Min: 1689     RPC/s Max: 1689     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 556.52   MB/s  Min: 556.52   MB/s  Max: 556.52   MB/s
          [W] Avg: 566.62   MB/s  Min: 566.62   MB/s  Max: 566.62   MB/s
          [LNet Rates of s]
          [R] Avg: 1690     RPC/s Min: 1690     RPC/s Max: 1690     RPC/s
          [W] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 566.36   MB/s  Min: 566.36   MB/s  Max: 566.36   MB/s
          [W] Avg: 556.22   MB/s  Min: 556.22   MB/s  Max: 556.22   MB/s
          [LNet Rates of c]
          [R] Avg: 2250     RPC/s Min: 2250     RPC/s Max: 2250     RPC/s
          [W] Avg: 1690     RPC/s Min: 1690     RPC/s Max: 1690     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 559.59   MB/s  Min: 559.59   MB/s  Max: 559.59   MB/s
          [W] Avg: 565.44   MB/s  Min: 565.44   MB/s  Max: 565.44   MB/s
          [LNet Rates of s]
          [R] Avg: 1691     RPC/s Min: 1691     RPC/s Max: 1691     RPC/s
          [W] Avg: 2250     RPC/s Min: 2250     RPC/s Max: 2250     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 565.11   MB/s  Min: 565.11   MB/s  Max: 565.11   MB/s
          [W] Avg: 559.31   MB/s  Min: 559.31   MB/s  Max: 559.31   MB/s
          [LNet Rates of c]
          [R] Avg: 2248     RPC/s Min: 2248     RPC/s Max: 2248     RPC/s
          [W] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 560.44   MB/s  Min: 560.44   MB/s  Max: 560.44   MB/s
          [W] Avg: 563.74   MB/s  Min: 563.74   MB/s  Max: 563.74   MB/s
          [LNet Rates of s]
          [R] Avg: 1687     RPC/s Min: 1687     RPC/s Max: 1687     RPC/s
          [W] Avg: 2247     RPC/s Min: 2247     RPC/s Max: 2247     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 563.46   MB/s  Min: 563.46   MB/s  Max: 563.46   MB/s
          [W] Avg: 560.16   MB/s  Min: 560.16   MB/s  Max: 560.16   MB/s
          
          ferner Frederik Ferner (Inactive) added a comment - The throughput in the previous test was good, though I've noticed that the throughput seems to drop to about 550MB/s if I use one client, one server and reduce concurrency to 1, I wonder if that is related to the single stream performance that we experience? Is the client effectively only ever writing to one server at a time or something similar? [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start -s cs04r-sc-oss05-03-10g -C 1 CONCURRENCY=1 session is ended SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No cs04r-sc-serv-68-10g are added to session cs04r-sc-oss05-03-10g are added to session Test was added successfully Test was added successfully b is running now Batch: b Tests: 2 State: 177 ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 1 0 0 0 1 Test 1(brw) (loop: 1800000, concurrency: 1) ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 1 0 0 0 1 Test 2(brw) (loop: 1800000, concurrency: 1) ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 1 0 0 0 1 [LNet Rates of c] [R] Avg: 2245 RPC/s Min: 2245 RPC/s Max: 2245 RPC/s [W] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s [LNet Bandwidth of c] [R] Avg: 557.79 MB/s Min: 557.79 MB/s Max: 557.79 MB/s [W] Avg: 565.19 MB/s Min: 565.19 MB/s Max: 565.19 MB/s [LNet Rates of s] [R] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s [W] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s [LNet Bandwidth of s] [R] Avg: 564.91 MB/s Min: 564.91 MB/s Max: 564.91 MB/s [W] Avg: 557.47 MB/s Min: 557.47 MB/s Max: 557.47 MB/s [LNet Rates of c] [R] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s [W] Avg: 1689 RPC/s Min: 1689 RPC/s Max: 1689 RPC/s [LNet Bandwidth of c] [R] Avg: 556.52 MB/s Min: 556.52 MB/s Max: 556.52 MB/s [W] Avg: 566.62 MB/s Min: 566.62 MB/s Max: 566.62 MB/s [LNet Rates of s] [R] Avg: 1690 RPC/s Min: 1690 RPC/s Max: 1690 RPC/s [W] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s [LNet Bandwidth of s] [R] Avg: 566.36 MB/s Min: 566.36 MB/s Max: 566.36 MB/s [W] Avg: 556.22 MB/s Min: 556.22 MB/s Max: 556.22 MB/s [LNet Rates of c] [R] Avg: 2250 RPC/s Min: 2250 RPC/s Max: 2250 RPC/s [W] Avg: 1690 RPC/s Min: 1690 RPC/s Max: 1690 RPC/s [LNet Bandwidth of c] [R] Avg: 559.59 MB/s Min: 559.59 MB/s Max: 559.59 MB/s [W] Avg: 565.44 MB/s Min: 565.44 MB/s Max: 565.44 MB/s [LNet Rates of s] [R] Avg: 1691 RPC/s Min: 1691 RPC/s Max: 1691 RPC/s [W] Avg: 2250 RPC/s Min: 2250 RPC/s Max: 2250 RPC/s [LNet Bandwidth of s] [R] Avg: 565.11 MB/s Min: 565.11 MB/s Max: 565.11 MB/s [W] Avg: 559.31 MB/s Min: 559.31 MB/s Max: 559.31 MB/s [LNet Rates of c] [R] Avg: 2248 RPC/s Min: 2248 RPC/s Max: 2248 RPC/s [W] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s [LNet Bandwidth of c] [R] Avg: 560.44 MB/s Min: 560.44 MB/s Max: 560.44 MB/s [W] Avg: 563.74 MB/s Min: 563.74 MB/s Max: 563.74 MB/s [LNet Rates of s] [R] Avg: 1687 RPC/s Min: 1687 RPC/s Max: 1687 RPC/s [W] Avg: 2247 RPC/s Min: 2247 RPC/s Max: 2247 RPC/s [LNet Bandwidth of s] [R] Avg: 563.46 MB/s Min: 563.46 MB/s Max: 563.46 MB/s [W] Avg: 560.16 MB/s Min: 560.16 MB/s Max: 560.16 MB/s

          Thanks for the lnet_selftest script, though it was a bit hard to read with the formating etc...

          I've quickly run that now on one client and two servers, output below:

          [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start
          SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
          cs04r-sc-serv-68-10g are added to session
          cs04r-sc-oss05-03-10g are added to session
          cs04r-sc-oss05-04-10g are added to session
          Test was added successfully
          Test was added successfully
          b is running now
          Batch: b Tests: 2 State: 177
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  2       0       0       0       2
                  Test 1(brw) (loop: 1800000, concurrency: 32)
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  2       0       0       0       2
                  Test 2(brw) (loop: 1800000, concurrency: 32)
                  ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
          client  1       0       0       0       1
          server  2       0       0       0       2
          
          [LNet Rates of c]
          [R] Avg: 7704     RPC/s Min: 7704     RPC/s Max: 7704     RPC/s
          [W] Avg: 5695     RPC/s Min: 5695     RPC/s Max: 5695     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 2011.80  MB/s  Min: 2011.80  MB/s  Max: 2011.80  MB/s
          [W] Avg: 1841.79  MB/s  Min: 1841.79  MB/s  Max: 1841.79  MB/s
          [LNet Rates of s]
          [R] Avg: 2849     RPC/s Min: 2208     RPC/s Max: 3490     RPC/s
          [W] Avg: 3853     RPC/s Min: 3045     RPC/s Max: 4661     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 921.26   MB/s  Min: 683.45   MB/s  Max: 1159.07  MB/s
          [W] Avg: 1005.93  MB/s  Min: 840.24   MB/s  Max: 1171.62  MB/s
          [LNet Rates of c]
          [R] Avg: 7634     RPC/s Min: 7634     RPC/s Max: 7634     RPC/s
          [W] Avg: 5634     RPC/s Min: 5634     RPC/s Max: 5634     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 1998.43  MB/s  Min: 1998.43  MB/s  Max: 1998.43  MB/s
          [W] Avg: 1819.24  MB/s  Min: 1819.24  MB/s  Max: 1819.24  MB/s
          [LNet Rates of s]
          [R] Avg: 2818     RPC/s Min: 2137     RPC/s Max: 3499     RPC/s
          [W] Avg: 3816     RPC/s Min: 2961     RPC/s Max: 4672     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 909.21   MB/s  Min: 656.14   MB/s  Max: 1162.28  MB/s
          [W] Avg: 998.65   MB/s  Min: 823.79   MB/s  Max: 1173.52  MB/s
          [LNet Rates of c]
          [R] Avg: 7322     RPC/s Min: 7322     RPC/s Max: 7322     RPC/s
          [W] Avg: 5409     RPC/s Min: 5409     RPC/s Max: 5409     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 1914.47  MB/s  Min: 1914.47  MB/s  Max: 1914.47  MB/s
          [W] Avg: 1747.85  MB/s  Min: 1747.85  MB/s  Max: 1747.85  MB/s
          [LNet Rates of s]
          [R] Avg: 2704     RPC/s Min: 1897     RPC/s Max: 3510     RPC/s
          [W] Avg: 3660     RPC/s Min: 2636     RPC/s Max: 4685     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 873.47   MB/s  Min: 579.29   MB/s  Max: 1167.64  MB/s
          [W] Avg: 956.83   MB/s  Min: 738.93   MB/s  Max: 1174.73  MB/s
          [LNet Rates of c]
          [R] Avg: 7580     RPC/s Min: 7580     RPC/s Max: 7580     RPC/s
          [W] Avg: 5594     RPC/s Min: 5594     RPC/s Max: 5594     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 1988.69  MB/s  Min: 1988.69  MB/s  Max: 1988.69  MB/s
          [W] Avg: 1803.03  MB/s  Min: 1803.03  MB/s  Max: 1803.03  MB/s
          [LNet Rates of s]
          [R] Avg: 2796     RPC/s Min: 2112     RPC/s Max: 3480     RPC/s
          [W] Avg: 3789     RPC/s Min: 2927     RPC/s Max: 4650     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 901.00   MB/s  Min: 647.69   MB/s  Max: 1154.30  MB/s
          [W] Avg: 993.80   MB/s  Min: 817.02   MB/s  Max: 1170.58  MB/s
          [LNet Rates of c]
          [R] Avg: 8064     RPC/s Min: 8064     RPC/s Max: 8064     RPC/s
          [W] Avg: 5957     RPC/s Min: 5957     RPC/s Max: 5957     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 2105.40  MB/s  Min: 2105.40  MB/s  Max: 2105.40  MB/s
          [W] Avg: 1926.91  MB/s  Min: 1926.91  MB/s  Max: 1926.91  MB/s
          [LNet Rates of s]
          [R] Avg: 2973     RPC/s Min: 2468     RPC/s Max: 3479     RPC/s
          [W] Avg: 4026     RPC/s Min: 3403     RPC/s Max: 4648     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 961.77   MB/s  Min: 768.23   MB/s  Max: 1155.32  MB/s
          [W] Avg: 1050.98  MB/s  Min: 932.80   MB/s  Max: 1169.15  MB/s
          [LNet Rates of c]
          [R] Avg: 7601     RPC/s Min: 7601     RPC/s Max: 7601     RPC/s
          [W] Avg: 5624     RPC/s Min: 5624     RPC/s Max: 5624     RPC/s
          [LNet Bandwidth of c]
          [R] Avg: 1977.67  MB/s  Min: 1977.67  MB/s  Max: 1977.67  MB/s
          [W] Avg: 1824.20  MB/s  Min: 1824.20  MB/s  Max: 1824.20  MB/s
          [LNet Rates of s]
          [R] Avg: 2814     RPC/s Min: 2173     RPC/s Max: 3454     RPC/s
          [W] Avg: 3802     RPC/s Min: 2993     RPC/s Max: 4610     RPC/s
          [LNet Bandwidth of s]
          [R] Avg: 912.18   MB/s  Min: 676.17   MB/s  Max: 1148.19  MB/s
          [W] Avg: 988.94   MB/s  Min: 820.77   MB/s  Max: 1157.10  MB/s
          No session exists
          

          This was running until I terminated it in another window:

          [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r stop
          c:
          Total 0 error nodes in c
          s:
          Total 0 error nodes in s
          1 batch in stopping
          Batch is stopped
          session is ended
          [bnh65367@cs04r-sc-serv-68 bin]$
          

          This was done using 2.3.59 on the client and 2.3.0 on the servers. The client
          is the same hardware as during previous tests and network configuration.

          ferner Frederik Ferner (Inactive) added a comment - Thanks for the lnet_selftest script, though it was a bit hard to read with the formating etc... I've quickly run that now on one client and two servers, output below: [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No cs04r-sc-serv-68-10g are added to session cs04r-sc-oss05-03-10g are added to session cs04r-sc-oss05-04-10g are added to session Test was added successfully Test was added successfully b is running now Batch: b Tests: 2 State: 177 ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 2 0 0 0 2 Test 1(brw) (loop: 1800000, concurrency: 32) ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 2 0 0 0 2 Test 2(brw) (loop: 1800000, concurrency: 32) ACTIVE BUSY DOWN UNKNOWN TOTAL client 1 0 0 0 1 server 2 0 0 0 2 [LNet Rates of c] [R] Avg: 7704 RPC/s Min: 7704 RPC/s Max: 7704 RPC/s [W] Avg: 5695 RPC/s Min: 5695 RPC/s Max: 5695 RPC/s [LNet Bandwidth of c] [R] Avg: 2011.80 MB/s Min: 2011.80 MB/s Max: 2011.80 MB/s [W] Avg: 1841.79 MB/s Min: 1841.79 MB/s Max: 1841.79 MB/s [LNet Rates of s] [R] Avg: 2849 RPC/s Min: 2208 RPC/s Max: 3490 RPC/s [W] Avg: 3853 RPC/s Min: 3045 RPC/s Max: 4661 RPC/s [LNet Bandwidth of s] [R] Avg: 921.26 MB/s Min: 683.45 MB/s Max: 1159.07 MB/s [W] Avg: 1005.93 MB/s Min: 840.24 MB/s Max: 1171.62 MB/s [LNet Rates of c] [R] Avg: 7634 RPC/s Min: 7634 RPC/s Max: 7634 RPC/s [W] Avg: 5634 RPC/s Min: 5634 RPC/s Max: 5634 RPC/s [LNet Bandwidth of c] [R] Avg: 1998.43 MB/s Min: 1998.43 MB/s Max: 1998.43 MB/s [W] Avg: 1819.24 MB/s Min: 1819.24 MB/s Max: 1819.24 MB/s [LNet Rates of s] [R] Avg: 2818 RPC/s Min: 2137 RPC/s Max: 3499 RPC/s [W] Avg: 3816 RPC/s Min: 2961 RPC/s Max: 4672 RPC/s [LNet Bandwidth of s] [R] Avg: 909.21 MB/s Min: 656.14 MB/s Max: 1162.28 MB/s [W] Avg: 998.65 MB/s Min: 823.79 MB/s Max: 1173.52 MB/s [LNet Rates of c] [R] Avg: 7322 RPC/s Min: 7322 RPC/s Max: 7322 RPC/s [W] Avg: 5409 RPC/s Min: 5409 RPC/s Max: 5409 RPC/s [LNet Bandwidth of c] [R] Avg: 1914.47 MB/s Min: 1914.47 MB/s Max: 1914.47 MB/s [W] Avg: 1747.85 MB/s Min: 1747.85 MB/s Max: 1747.85 MB/s [LNet Rates of s] [R] Avg: 2704 RPC/s Min: 1897 RPC/s Max: 3510 RPC/s [W] Avg: 3660 RPC/s Min: 2636 RPC/s Max: 4685 RPC/s [LNet Bandwidth of s] [R] Avg: 873.47 MB/s Min: 579.29 MB/s Max: 1167.64 MB/s [W] Avg: 956.83 MB/s Min: 738.93 MB/s Max: 1174.73 MB/s [LNet Rates of c] [R] Avg: 7580 RPC/s Min: 7580 RPC/s Max: 7580 RPC/s [W] Avg: 5594 RPC/s Min: 5594 RPC/s Max: 5594 RPC/s [LNet Bandwidth of c] [R] Avg: 1988.69 MB/s Min: 1988.69 MB/s Max: 1988.69 MB/s [W] Avg: 1803.03 MB/s Min: 1803.03 MB/s Max: 1803.03 MB/s [LNet Rates of s] [R] Avg: 2796 RPC/s Min: 2112 RPC/s Max: 3480 RPC/s [W] Avg: 3789 RPC/s Min: 2927 RPC/s Max: 4650 RPC/s [LNet Bandwidth of s] [R] Avg: 901.00 MB/s Min: 647.69 MB/s Max: 1154.30 MB/s [W] Avg: 993.80 MB/s Min: 817.02 MB/s Max: 1170.58 MB/s [LNet Rates of c] [R] Avg: 8064 RPC/s Min: 8064 RPC/s Max: 8064 RPC/s [W] Avg: 5957 RPC/s Min: 5957 RPC/s Max: 5957 RPC/s [LNet Bandwidth of c] [R] Avg: 2105.40 MB/s Min: 2105.40 MB/s Max: 2105.40 MB/s [W] Avg: 1926.91 MB/s Min: 1926.91 MB/s Max: 1926.91 MB/s [LNet Rates of s] [R] Avg: 2973 RPC/s Min: 2468 RPC/s Max: 3479 RPC/s [W] Avg: 4026 RPC/s Min: 3403 RPC/s Max: 4648 RPC/s [LNet Bandwidth of s] [R] Avg: 961.77 MB/s Min: 768.23 MB/s Max: 1155.32 MB/s [W] Avg: 1050.98 MB/s Min: 932.80 MB/s Max: 1169.15 MB/s [LNet Rates of c] [R] Avg: 7601 RPC/s Min: 7601 RPC/s Max: 7601 RPC/s [W] Avg: 5624 RPC/s Min: 5624 RPC/s Max: 5624 RPC/s [LNet Bandwidth of c] [R] Avg: 1977.67 MB/s Min: 1977.67 MB/s Max: 1977.67 MB/s [W] Avg: 1824.20 MB/s Min: 1824.20 MB/s Max: 1824.20 MB/s [LNet Rates of s] [R] Avg: 2814 RPC/s Min: 2173 RPC/s Max: 3454 RPC/s [W] Avg: 3802 RPC/s Min: 2993 RPC/s Max: 4610 RPC/s [LNet Bandwidth of s] [R] Avg: 912.18 MB/s Min: 676.17 MB/s Max: 1148.19 MB/s [W] Avg: 988.94 MB/s Min: 820.77 MB/s Max: 1157.10 MB/s No session exists This was running until I terminated it in another window: [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r stop c: Total 0 error nodes in c s: Total 0 error nodes in s 1 batch in stopping Batch is stopped session is ended [bnh65367@cs04r-sc-serv-68 bin]$ This was done using 2.3.59 on the client and 2.3.0 on the servers. The client is the same hardware as during previous tests and network configuration.
          mdiep Minh Diep added a comment -

          Here is a sample script to run brw_test that I used before. You can edit to fit your env

          #!/bin/sh

          PATH=$PATH:/usr/sbin
          SIZE=1M
          USAGE="usage: $0 -s server_list -c client_list -k session_key -r start|stop -S size"
          while getopts :s:c:k:r:S: opt_char
          do
          case $opt_char in
          s) S=$OPTARG;;
          c) C=$OPTARG;;
          k) KEY=$OPTARG;;
          r) STATE=$OPTARG;;
          S) SIZE=$OPTARG;;
          echo "The $OPTARG option requires an argument."
          exit 1;;
          ?) echo "$OPTARG is not a valid option."
          echo "$USAGE"
          exit 1;;
          esac
          done

          C=xxx@o2ib1
          S=xxx@o2ib1

          C_COUNT=`echo $C | wc -w`
          S_COUNT=`echo $S | wc -w`
          case "$STATE" in
          start)
          #try to clear the old session if any
          export LST_SESSION=`lst show_session 2>/dev/null | awk -F " " '

          {print $5}

          '`
          [ "$LST_SESSION" != "" ] && lst end_session
          export LST_SESSION=$KEY
          lst new_session --timeo 100000 hh
          lst add_group c $C
          lst add_group s $S
          lst add_batch b
          lst add_test --batch b --loop 1800000 --concurrency 32 \
          --distribute $C_COUNT:$S_COUNT --from c \
          --to s brw read check=full size=$SIZE
          lst add_test --batch b --loop 1800000 --concurrency 32 \
          --distribute $C_COUNT:$S_COUNT --from c \
          --to s brw write check=full size=$SIZE
          lst run b
          sleep 5
          lst list_batch b
          echo ""
          lst stat --delay 20 c s
          ;;
          stop)
          export LST_SESSION=$KEY
          lst show_error c s
          lst stop b
          lst end_session
          ;;
          esac

          mdiep Minh Diep added a comment - Here is a sample script to run brw_test that I used before. You can edit to fit your env #!/bin/sh PATH=$PATH:/usr/sbin SIZE=1M USAGE="usage: $0 -s server_list -c client_list -k session_key -r start|stop -S size" while getopts :s:c:k:r:S: opt_char do case $opt_char in s) S=$OPTARG;; c) C=$OPTARG;; k) KEY=$OPTARG;; r) STATE=$OPTARG;; S) SIZE=$OPTARG;; echo "The $OPTARG option requires an argument." exit 1;; ?) echo "$OPTARG is not a valid option." echo "$USAGE" exit 1;; esac done C=xxx@o2ib1 S=xxx@o2ib1 C_COUNT=`echo $C | wc -w` S_COUNT=`echo $S | wc -w` case "$STATE" in start) #try to clear the old session if any export LST_SESSION=`lst show_session 2>/dev/null | awk -F " " ' {print $5} '` [ "$LST_SESSION" != "" ] && lst end_session export LST_SESSION=$KEY lst new_session --timeo 100000 hh lst add_group c $C lst add_group s $S lst add_batch b lst add_test --batch b --loop 1800000 --concurrency 32 \ --distribute $C_COUNT:$S_COUNT --from c \ --to s brw read check=full size=$SIZE lst add_test --batch b --loop 1800000 --concurrency 32 \ --distribute $C_COUNT:$S_COUNT --from c \ --to s brw write check=full size=$SIZE lst run b sleep 5 lst list_batch b echo "" lst stat --delay 20 c s ;; stop) export LST_SESSION=$KEY lst show_error c s lst stop b lst end_session ;; esac

          People

            mdiep Minh Diep
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: