Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 1.8.7
    • 3
    • 6189

    Description

      (Sorry, this is kind of a book.)

      For the past couple of weeks, NOAA has been having significantly reduced performance on one of their filesystems, scratch2. Both scratch2 and scratch1 have the same type of hardware and are connected the same IB switches and network. The problem was first reported by a user doing a lot of opens and reads. The streaming rates appear to be normal, but the open rates have dropped significantly.

      We ran mdtest to confirm the drop:

      [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
      /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
      -d /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/14/2013 15:34:17 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
      2000 -i 3 -u -w 256 -d
      /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 2504.0 TiB Used FS: 33.7% Inodes: 2250.0 Mi Used Inodes: 5.1%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 3567.963 3526.380 3544.592 17.364
      Directory stat : 38138.970 32977.207 36170.723 2278.437
      Directory removal : 2184.674 2171.584 2178.737 5.412
      File creation : 3251.627 2585.204 3024.967 311.008
      File stat : 14236.137 13053.559 13622.039 483.862
      File removal : 3274.005 3046.231 3138.367 97.944
      Tree creation : 511.472 407.753 460.822 42.378
      Tree removal : 290.478 282.657 286.577 3.193

      – finished at 01/14/2013 15:35:29 –

      [Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
      /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
      -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/14/2013 15:25:24 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
      2000 -i 3 -u -w 256 -d
      /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 3130.0 TiB Used FS: 39.7% Inodes: 2250.0 Mi Used Inodes: 14.2%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 2327.187 1901.660 2094.272 176.043
      Directory stat : 10265.979 8306.610 9315.476 800.973
      Directory removal : 1600.981 1301.208 1407.570 136.989
      File creation : 1592.205 1426.700 1528.690 72.839
      File stat : 913.205 581.097 740.446 135.914
      File removal : 1733.900 1288.562 1492.555 183.717
      Tree creation : 303.718 241.506 266.777 26.705
      Tree removal : 159.793 66.400 122.692 40.470

      – finished at 01/14/2013 15:28:37 –

      We've investigated a few different areas without much luck. First we looked at the storage to make sure it looked ok, and it does look remarkably clean. Drive and LUN latencies are all low. We also looked at iostat on the MDS to make sure that the average wait time was in line with what was seen on scratch1, and it also all looks normal.

      We took a look at the average request wait time on the client, and found that requests to the OSTs and MDTs are taking significantly longer to complete on scratch2 vs scratch1:
      [Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/

      {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}'
      4326.63
      [Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }

      '; done | awk '

      {s+=$1/220} END {print s}'
      883.17

      scratch1 mdc avg req_waittime 191.04
      scratch2 mdc avg req_waittime 1220.99

      We investigated the IB network some, but couldn't find anything obviously wrong. One interesting thing to note is that some of the OSSes on scratch2 use asymmetric routes between themselves and the test client (bmem5). For example:
      [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 62 31
      From ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"
      [1] -> switch port {0x00066a00e3002e7c}[2] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
      [20] -> switch port {0x00066a00e30032de}[2] lid 2-2 "QLogic 12300 GUID=0x00066a00e30032de"
      [28] -> switch port {0x00066a00e3003338}[10] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
      [32] -> ca port {0x0002c903000f9194}[2] lid 31-31 "lfs-oss-2-01 HCA-2"
      To ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"

      [root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 31 62
      From ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"
      [2] -> switch port {0x00066a00e3003338}[32] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
      [1] -> switch port {0x00066a00e3003339}[19] lid 3-3 "QLogic 12300 GUID=0x00066a00e3003339"
      [10] -> switch port {0x00066a00e3002e7c}[28] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
      [2] -> ca port {0x0002c903000faa4b}[1] lid 62-62 "bmem5 mlx4_1"
      To ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"

      The path from the client to the oss uses switch 0x...32de, but the return path uses switch 0x...3339. I'm not sure if this has any effect on latency, but it's interesting to note.

      The slow average request time also holds from the MDTs to the OST:
      [root@lfs-mds-1-1 ~]# for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220}

      END

      {print s}'
      2445.21
      [root@lfs-mds-2-2 ~]# for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}

      '
      42442.6

      One individual OSS had an average response time of 61s:
      61061.2 OST0053

      Checking the brw_stats for that OST, and found that the vast majority of IOs are completed in less than 1s. Less than 1% of all IO took 1s+, most IOs (51%) were done in <= 16ms. This leads me to believe that we can rule out the storage.

      We were looking at loading the OST metadata using debugfs and realized that scratch2 has its caches enabled on the OSTs, while scratch1 does not. We disabled the read and write caches, and performance increased, though still not to the levels of scratch1:
      [root@fe1 Dennis.Nelson]# mpirun -np 8 /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      – started at 01/18/2013 22:33:08 –

      mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
      Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
      Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
      FS: 3130.0 TiB Used FS: 37.8% Inodes: 2250.0 Mi Used Inodes: 15.2%

      8 tasks, 15912 files/directories

      SUMMARY: (of 3 iterations)
      Operation Max Min Mean Std Dev
      --------- — — ---- -------
      Directory creation: 2551.993 2263.761 2403.099 117.864
      Directory stat : 13057.811 10732.722 11987.916 958.214
      Directory removal : 1635.204 1550.824 1602.331 36.882
      File creation : 2396.830 2291.217 2330.893 46.945
      File stat : 9299.840 6609.414 7960.092 1098.389
      File removal : 2082.180 1886.251 1966.489 83.824
      Tree creation : 392.155 276.295 330.162 47.647
      Tree removal : 223.157 185.368 203.008 15.529

      – finished at 01/18/2013 22:34:53 –

      The next thing we're going to try is doing small message LNET testing to verify that latency isn't worse on the scratch2 paths versus the scratch1 paths. The OSSes themselves are not reporting abnormal reqwait times. oss-2-08 reports an average wait time of 185.83ms, and oss-1-01 reports 265.72ms (checking using llstat ost).

      I was wondering if you all had any suggestions of what we could check next. Is there any information I can get you? Also I was wondering what exactly the req_waitimes measured. Am I reading them correctly?

      Thanks,
      Kit

      Attachments

        1. dlm_trace.scratch1
          36 kB
        2. dlm_trace.scratch2
          36 kB
        3. lnet_data.tgz
          53 kB
        4. processed_trace.scratch1
          31 kB
        5. processed_trace.scratch2
          30 kB
        6. scratch_1_2.deltas
          0.9 kB
        7. scratch_1_2.sd_iostats
          3 kB
        8. test_plan.txt
          2 kB

        Activity

          [LU-2649] slow IO at NOAA

          It seems like credits changes need to be made on all systems (clients and servers).

          I got these errors trying to ping from a node with credit changes to one without:
          LustreError: 9688:0:(o2iblnd.c:806:kiblnd_create_conn()) Can't create QP: -22, send_wr: 16448, recv_wr: 256
          LustreError: 7641:0:(o2iblnd_cb.c:2190:kiblnd_passive_connect()) Can't accept 10.175.31.242@o2ib: incompatible queue depth 8 (128 wanted)

          kitwestneat Kit Westneat (Inactive) added a comment - It seems like credits changes need to be made on all systems (clients and servers). I got these errors trying to ping from a node with credit changes to one without: LustreError: 9688:0:(o2iblnd.c:806:kiblnd_create_conn()) Can't create QP: -22, send_wr: 16448, recv_wr: 256 LustreError: 7641:0:(o2iblnd_cb.c:2190:kiblnd_passive_connect()) Can't accept 10.175.31.242@o2ib: incompatible queue depth 8 (128 wanted)
          green Oleg Drokin added a comment -

          Note if you enable any extra debug flags, you'll likely have (significant, depending on flags) slower metadata performance for certain debug flags like +inode or +dlmtrace.

          It looks like scratch1 is almost 2x less loaded than scratch2 at times, and I suspect that on the fully quiet system you'll have same level of performance.

          dirstat is only affected by network latency and how busy MDS/disk is (even then I suspect for mdstat run created directories are fully in cache and disk does not play any role other than when it slows down other threads).
          I suspect that if you forcefully increase number of mdt threads (module parameter for mds module) then dirstat should go up if you are running with a small number by default.

          green Oleg Drokin added a comment - Note if you enable any extra debug flags, you'll likely have (significant, depending on flags) slower metadata performance for certain debug flags like +inode or +dlmtrace. It looks like scratch1 is almost 2x less loaded than scratch2 at times, and I suspect that on the fully quiet system you'll have same level of performance. dirstat is only affected by network latency and how busy MDS/disk is (even then I suspect for mdstat run created directories are fully in cache and disk does not play any role other than when it slows down other threads). I suspect that if you forcefully increase number of mdt threads (module parameter for mds module) then dirstat should go up if you are running with a small number by default.

          Credit changes affect the "throttling" on the system they are set on. I suspect the important systems to change here are just the servers. The clients (assuming there are many of them) should be fine as is. The changes don't need to be the same on both sides of a connection.

          doug Doug Oucharek (Inactive) added a comment - Credit changes affect the "throttling" on the system they are set on. I suspect the important systems to change here are just the servers. The clients (assuming there are many of them) should be fine as is. The changes don't need to be the same on both sides of a connection.

          Ok, the reason I ask is that any changes have to be approved through a change review meeting, and I'm sure that they will ask me those questions.

          Actually another question I have is: can those be set only on the servers, or do they need to be set on the clients too? I remember the last time I played around with ko2iblnd settings, there were some that had to be set on both the client and the servers, or they would stop communicating.

          kitwestneat Kit Westneat (Inactive) added a comment - Ok, the reason I ask is that any changes have to be approved through a change review meeting, and I'm sure that they will ask me those questions. Actually another question I have is: can those be set only on the servers, or do they need to be set on the clients too? I remember the last time I played around with ko2iblnd settings, there were some that had to be set on both the client and the servers, or they would stop communicating.

          I think the risks are minimal, shouldn't hurt anything. I am not certain how likely it is that it will help this issue.

          cliffw Cliff White (Inactive) added a comment - I think the risks are minimal, shouldn't hurt anything. I am not certain how likely it is that it will help this issue.

          What are the risks in making that change and how likely do you think it is that those changes will affect performance?

          kitwestneat Kit Westneat (Inactive) added a comment - What are the risks in making that change and how likely do you think it is that those changes will affect performance?

          What about the peer credits tuning mentioned earlier?

          cliffw Cliff White (Inactive) added a comment - What about the peer credits tuning mentioned earlier?

          I've attached the test plan for the downtime next week, can you review it and see if there is anything else we should be looking at or trying?

          kitwestneat Kit Westneat (Inactive) added a comment - I've attached the test plan for the downtime next week, can you review it and see if there is anything else we should be looking at or trying?

          differences in /proc/fs/lustre/mds/scratch2-MDT0000/stats over the run of an mdtest. You can see scratch2 was much more loaded during this time. The opens, closes, and unlinks stick out as major differences too.

          kitwestneat Kit Westneat (Inactive) added a comment - differences in /proc/fs/lustre/mds/scratch2-MDT0000/stats over the run of an mdtest. You can see scratch2 was much more loaded during this time. The opens, closes, and unlinks stick out as major differences too.

          IO statistics for the sd devices, they look very similar between the FSes

          kitwestneat Kit Westneat (Inactive) added a comment - IO statistics for the sd devices, they look very similar between the FSes

          I looked at the /proc/fs/lustre/mds/*/stats and the /proc/fs/lustre/mdt/MDS/mds/stats files to see how loaded the system was. scratch2 does seem more loaded, but it might have just been when I was looking. I tried to find a moment when scratch2 was less loaded in order to run the benchmark. The mdtest showed some improvement but the dirstats were still significantly lower than on scratch1. Clearly the load on the system has some effect, and there is a downtime next week we'll use to test.

          What sorts of tests would be useful on a quiet system? I was going to enable various debug flags in order to be able to hopefully trace out what is going on. Is there anything else you would like to look at?

          What could cause dirstat performance to be so much worse than the other metadata operations?

          Here are some example req_* llstat mds output:
          scratch1:
          /proc/fs/lustre/mdt/MDS/mds/stats @ 1359617590.869072
          Name Cur.Count Cur.Rate #Events Unit last min avg max stddev
          req_waittime 14886 7443 4161976 [usec] 300812 3 18.38 3017 14.82
          req_qdepth 14886 7443 4161976 [reqs] 1870 0 0.18 14 0.44
          req_active 14886 7443 4161976 [reqs] 22011 1 1.89 55 1.59
          req_timeout 14886 7443 4161976 [sec] 44658 3 26.59 31 10.20
          reqbuf_avail 31265 15632 9080719 [bufs] 96035468 3044 3071.47 3072 1.07
          ldlm_flock_enqueue 5 2 3190 [reqs] 5 1 1.00 1 0.00
          ldlm_ibits_enqueue 14199 7099 3808598 [reqs] 14199 1 1.00 1 0.00

          scratch2:
          /proc/fs/lustre/mdt/MDS/mds/stats @ 1359617591.258278
          Name Cur.Count Cur.Rate #Events Unit last min avg max stddev
          req_waittime 18790 9395 22080943 [usec] 389637 3 21.72 103629 171.90
          req_qdepth 18790 9395 22080943 [reqs] 5206 0 0.33 147 0.92
          req_active 18790 9395 22080943 [reqs] 64994 1 5.17 127 12.89
          req_timeout 18790 9395 22080943 [sec] 93950 2 15.56 33 14.32
          reqbuf_avail 44790 22395 53001699 [bufs] 894286323 19850 19965.41 19968 8.94
          ldlm_flock_enqueue 38 19 19939 [reqs] 38 1 1.00 1 0.00
          ldlm_ibits_enqueue 13264 6632 15674307 [reqs] 13264 1 1.00 1 0.00

          I was wondering why the reqbuf_avail on scratch2 was so much higher?

          kitwestneat Kit Westneat (Inactive) added a comment - I looked at the /proc/fs/lustre/mds/*/stats and the /proc/fs/lustre/mdt/MDS/mds/stats files to see how loaded the system was. scratch2 does seem more loaded, but it might have just been when I was looking. I tried to find a moment when scratch2 was less loaded in order to run the benchmark. The mdtest showed some improvement but the dirstats were still significantly lower than on scratch1. Clearly the load on the system has some effect, and there is a downtime next week we'll use to test. What sorts of tests would be useful on a quiet system? I was going to enable various debug flags in order to be able to hopefully trace out what is going on. Is there anything else you would like to look at? What could cause dirstat performance to be so much worse than the other metadata operations? Here are some example req_* llstat mds output: scratch1: /proc/fs/lustre/mdt/MDS/mds/stats @ 1359617590.869072 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 14886 7443 4161976 [usec] 300812 3 18.38 3017 14.82 req_qdepth 14886 7443 4161976 [reqs] 1870 0 0.18 14 0.44 req_active 14886 7443 4161976 [reqs] 22011 1 1.89 55 1.59 req_timeout 14886 7443 4161976 [sec] 44658 3 26.59 31 10.20 reqbuf_avail 31265 15632 9080719 [bufs] 96035468 3044 3071.47 3072 1.07 ldlm_flock_enqueue 5 2 3190 [reqs] 5 1 1.00 1 0.00 ldlm_ibits_enqueue 14199 7099 3808598 [reqs] 14199 1 1.00 1 0.00 scratch2: /proc/fs/lustre/mdt/MDS/mds/stats @ 1359617591.258278 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 18790 9395 22080943 [usec] 389637 3 21.72 103629 171.90 req_qdepth 18790 9395 22080943 [reqs] 5206 0 0.33 147 0.92 req_active 18790 9395 22080943 [reqs] 64994 1 5.17 127 12.89 req_timeout 18790 9395 22080943 [sec] 93950 2 15.56 33 14.32 reqbuf_avail 44790 22395 53001699 [bufs] 894286323 19850 19965.41 19968 8.94 ldlm_flock_enqueue 38 19 19939 [reqs] 38 1 1.00 1 0.00 ldlm_ibits_enqueue 13264 6632 15674307 [reqs] 13264 1 1.00 1 0.00 I was wondering why the reqbuf_avail on scratch2 was so much higher?

          People

            cliffw Cliff White (Inactive)
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: