Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2649

slow IO at NOAA

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 1.8.7
    • 3
    • 6189

    Description

      (Sorry, this is kind of a book.)

      For the past couple of weeks, NOAA has been having significantly reduced performance on one of their filesystems, scratch2. Both scratch2 and scratch1 have the same type of hardware and are connected the same IB switches and network. The problem was first reported by a user doing a lot of opens and reads. The streaming rates appear to be normal, but the open rates have dropped significantly.

      We ran mdtest to confirm the drop:

      [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
      /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
      -d /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/14/2013 15:34:17 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
      2000 -i 3 -u -w 256 -d
      /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 2504.0 TiB Used FS: 33.7% Inodes: 2250.0 Mi Used Inodes: 5.1%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 3567.963 3526.380 3544.592 17.364
      Directory stat : 38138.970 32977.207 36170.723 2278.437
      Directory removal : 2184.674 2171.584 2178.737 5.412
      File creation : 3251.627 2585.204 3024.967 311.008
      File stat : 14236.137 13053.559 13622.039 483.862
      File removal : 3274.005 3046.231 3138.367 97.944
      Tree creation : 511.472 407.753 460.822 42.378
      Tree removal : 290.478 282.657 286.577 3.193

      – finished at 01/14/2013 15:35:29 –

      [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
      /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
      -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/14/2013 15:25:24 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
      2000 -i 3 -u -w 256 -d
      /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 3130.0 TiB Used FS: 39.7% Inodes: 2250.0 Mi Used Inodes: 14.2%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 2327.187 1901.660 2094.272 176.043
      Directory stat : 10265.979 8306.610 9315.476 800.973
      Directory removal : 1600.981 1301.208 1407.570 136.989
      File creation : 1592.205 1426.700 1528.690 72.839
      File stat : 913.205 581.097 740.446 135.914
      File removal : 1733.900 1288.562 1492.555 183.717
      Tree creation : 303.718 241.506 266.777 26.705
      Tree removal : 159.793 66.400 122.692 40.470

      – finished at 01/14/2013 15:28:37 –

      We've investigated a few different areas without much luck. First we looked at the storage to make sure it looked ok, and it does look remarkably clean. Drive and LUN latencies are all low. We also looked at iostat on the MDS to make sure that the average wait time was in line with what was seen on scratch1, and it also all looks normal.

      We took a look at the average request wait time on the client, and found that requests to the OSTs and MDTs are taking significantly longer to complete on scratch2 vs scratch1:
      [Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/

      {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}'
      4326.63
      [Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }

      '; done | awk '

      {s+=$1/220} END {print s}'
      883.17

      scratch1 mdc avg req_waittime 191.04
      scratch2 mdc avg req_waittime 1220.99

      We investigated the IB network some, but couldn't find anything obviously wrong. One interesting thing to note is that some of the OSSes on scratch2 use asymmetric routes between themselves and the test client (bmem5). For example:
      [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 62 31
      From ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"
      [1] -> switch port {0x00066a00e3002e7c}[2] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
      [20] -> switch port {0x00066a00e30032de}[2] lid 2-2 "QLogic 12300 GUID=0x00066a00e30032de"
      [28] -> switch port {0x00066a00e3003338}[10] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
      [32] -> ca port {0x0002c903000f9194}[2] lid 31-31 "lfs-oss-2-01 HCA-2"
      To ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"

      [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 31 62
      From ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"
      [2] -> switch port {0x00066a00e3003338}[32] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
      [1] -> switch port {0x00066a00e3003339}[19] lid 3-3 "QLogic 12300 GUID=0x00066a00e3003339"
      [10] -> switch port {0x00066a00e3002e7c}[28] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
      [2] -> ca port {0x0002c903000faa4b}[1] lid 62-62 "bmem5 mlx4_1"
      To ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"

      The path from the client to the oss uses switch 0x...32de, but the return path uses switch 0x...3339. I'm not sure if this has any effect on latency, but it's interesting to note.

      The slow average request time also holds from the MDTs to the OST:
      [root@lfs-mds-1-1 ~]# for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220}

      END

      {print s}'
      2445.21
      [root@lfs-mds-2-2 ~]# for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}

      '
      42442.6

      One individual OSS had an average response time of 61s:
      61061.2 OST0053

      Checking the brw_stats for that OST, and found that the vast majority of IOs are completed in less than 1s. Less than 1% of all IO took 1s+, most IOs (51%) were done in <= 16ms. This leads me to believe that we can rule out the storage.

      We were looking at loading the OST metadata using debugfs and realized that scratch2 has its caches enabled on the OSTs, while scratch1 does not. We disabled the read and write caches, and performance increased, though still not to the levels of scratch1:
      [root@fe1 Dennis.Nelson]# mpirun -np 8 /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/18/2013 22:33:08 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 3130.0 TiB Used FS: 37.8% Inodes: 2250.0 Mi Used Inodes: 15.2%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 2551.993 2263.761 2403.099 117.864
      Directory stat : 13057.811 10732.722 11987.916 958.214
      Directory removal : 1635.204 1550.824 1602.331 36.882
      File creation : 2396.830 2291.217 2330.893 46.945
      File stat : 9299.840 6609.414 7960.092 1098.389
      File removal : 2082.180 1886.251 1966.489 83.824
      Tree creation : 392.155 276.295 330.162 47.647
      Tree removal : 223.157 185.368 203.008 15.529

      – finished at 01/18/2013 22:34:53 –

      The next thing we're going to try is doing small message LNET testing to verify that latency isn't worse on the scratch2 paths versus the scratch1 paths. The OSSes themselves are not reporting abnormal reqwait times. oss-2-08 reports an average wait time of 185.83ms, and oss-1-01 reports 265.72ms (checking using llstat ost).

      I was wondering if you all had any suggestions of what we could check next. Is there any information I can get you? Also I was wondering what exactly the req_waitimes measured. Am I reading them correctly?

      Thanks,
      Kit

      Attachments

        1. lnet_data.tgz
          53 kB
        2. processed_trace.scratch1
          31 kB
        3. processed_trace.scratch2
          30 kB
        4. dlm_trace.scratch1
          36 kB
        5. dlm_trace.scratch2
          36 kB
        6. scratch_1_2.sd_iostats
          3 kB
        7. scratch_1_2.deltas
          0.9 kB
        8. test_plan.txt
          2 kB

        Activity

          People

            cliffw Cliff White (Inactive)
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: