[LU-2649] slow IO at NOAA - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.7
Labels:
- ptr

Severity:
3
Rank (Obsolete):
6189

Description

(Sorry, this is kind of a book.)

For the past couple of weeks, NOAA has been having significantly reduced performance on one of their filesystems, scratch2. Both scratch2 and scratch1 have the same type of hardware and are connected the same IB switches and network. The problem was first reported by a user doing a lot of opens and reads. The streaming rates appear to be normal, but the open rates have dropped significantly.

We ran mdtest to confirm the drop:

[Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
/home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
-d /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
– started at 01/14/2013 15:34:17 –

mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
2000 -i 3 -u -w 256 -d
/scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
Path: /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson
FS: 2504.0 TiB Used FS: 33.7% Inodes: 2250.0 Mi Used Inodes: 5.1%

8 tasks, 15912 files/directories

SUMMARY: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- — — ---- -------
Directory creation: 3567.963 3526.380 3544.592 17.364
Directory stat : 38138.970 32977.207 36170.723 2278.437
Directory removal : 2184.674 2171.584 2178.737 5.412
File creation : 3251.627 2585.204 3024.967 311.008
File stat : 14236.137 13053.559 13622.039 483.862
File removal : 3274.005 3046.231 3138.367 97.944
Tree creation : 511.472 407.753 460.822 42.378
Tree removal : 290.478 282.657 286.577 3.193

– finished at 01/14/2013 15:35:29 –

[Dennis.Nelson@fe7 ~/mdtest-1.8.3]$ mpirun -np 8
/home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256
-d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
– started at 01/14/2013 15:25:24 –

mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n
2000 -i 3 -u -w 256 -d
/scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
FS: 3130.0 TiB Used FS: 39.7% Inodes: 2250.0 Mi Used Inodes: 14.2%

8 tasks, 15912 files/directories

SUMMARY: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- — — ---- -------
Directory creation: 2327.187 1901.660 2094.272 176.043
Directory stat : 10265.979 8306.610 9315.476 800.973
Directory removal : 1600.981 1301.208 1407.570 136.989
File creation : 1592.205 1426.700 1528.690 72.839
File stat : 913.205 581.097 740.446 135.914
File removal : 1733.900 1288.562 1492.555 183.717
Tree creation : 303.718 241.506 266.777 26.705
Tree removal : 159.793 66.400 122.692 40.470

– finished at 01/14/2013 15:28:37 –

We've investigated a few different areas without much luck. First we looked at the storage to make sure it looked ok, and it does look remarkably clean. Drive and LUN latencies are all low. We also looked at iostat on the MDS to make sure that the average wait time was in line with what was seen on scratch1, and it also all looks normal.

We took a look at the average request wait time on the client, and found that requests to the OSTs and MDTs are taking significantly longer to complete on scratch2 vs scratch1:
[Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/

{print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}'
4326.63
[Kit.Westneat@bmem5 ~]$ for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }

'; done | awk '

{s+=$1/220} END {print s}'
883.17

scratch1 mdc avg req_waittime 191.04
scratch2 mdc avg req_waittime 1220.99

We investigated the IB network some, but couldn't find anything obviously wrong. One interesting thing to note is that some of the OSSes on scratch2 use asymmetric routes between themselves and the test client (bmem5). For example:
[root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 62 31
From ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"
[1] -> switch port {0x00066a00e3002e7c}[2] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
[20] -> switch port {0x00066a00e30032de}[2] lid 2-2 "QLogic 12300 GUID=0x00066a00e30032de"
[28] -> switch port {0x00066a00e3003338}[10] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
[32] -> ca port {0x0002c903000f9194}[2] lid 31-31 "lfs-oss-2-01 HCA-2"
To ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"

[root@lfs-mds-2-1 ~]# ibtracert -C mlx4_1 -P2 31 62
From ca {0x0002c903000f9192} portnum 2 lid 31-31 "lfs-oss-2-01 HCA-2"
[2] -> switch port {0x00066a00e3003338}[32] lid 6-6 "QLogic 12300 GUID=0x00066a00e3003338"
[1] -> switch port {0x00066a00e3003339}[19] lid 3-3 "QLogic 12300 GUID=0x00066a00e3003339"
[10] -> switch port {0x00066a00e3002e7c}[28] lid 4-4 "QLogic 12300 GUID=0x00066a00e3002e7c"
[2] -> ca port {0x0002c903000faa4b}[1] lid 62-62 "bmem5 mlx4_1"
To ca {0x0002c903000faa4a} portnum 1 lid 62-62 "bmem5 mlx4_1"

The path from the client to the oss uses switch 0x...32de, but the return path uses switch 0x...3339. I'm not sure if this has any effect on latency, but it's interesting to note.

The slow average request time also holds from the MDTs to the OST:
[root@lfs-mds-1-1 ~]# for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220}

END

{print s}'
2445.21
[root@lfs-mds-2-2 ~]# for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk '/req_wait/ {print $7 / $2" '$(echo $x | grep -o OST....)'" }'; done | awk '{s+=$1/220} END {print s}

'
42442.6

One individual OSS had an average response time of 61s:
61061.2 OST0053

Checking the brw_stats for that OST, and found that the vast majority of IOs are completed in less than 1s. Less than 1% of all IO took 1s+, most IOs (51%) were done in <= 16ms. This leads me to believe that we can rule out the storage.

We were looking at loading the OST metadata using debugfs and realized that scratch2 has its caches enabled on the OSTs, while scratch1 does not. We disabled the read and write caches, and performance increased, though still not to the levels of scratch1:
[root@fe1 Dennis.Nelson]# mpirun -np 8 /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
– started at 01/18/2013 22:33:08 –

mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest
Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson
FS: 3130.0 TiB Used FS: 37.8% Inodes: 2250.0 Mi Used Inodes: 15.2%

8 tasks, 15912 files/directories

SUMMARY: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- — — ---- -------
Directory creation: 2551.993 2263.761 2403.099 117.864
Directory stat : 13057.811 10732.722 11987.916 958.214
Directory removal : 1635.204 1550.824 1602.331 36.882
File creation : 2396.830 2291.217 2330.893 46.945
File stat : 9299.840 6609.414 7960.092 1098.389
File removal : 2082.180 1886.251 1966.489 83.824
Tree creation : 392.155 276.295 330.162 47.647
Tree removal : 223.157 185.368 203.008 15.529

– finished at 01/18/2013 22:34:53 –

The next thing we're going to try is doing small message LNET testing to verify that latency isn't worse on the scratch2 paths versus the scratch1 paths. The OSSes themselves are not reporting abnormal reqwait times. oss-2-08 reports an average wait time of 185.83ms, and oss-1-01 reports 265.72ms (checking using llstat ost).

I was wondering if you all had any suggestions of what we could check next. Is there any information I can get you? Also I was wondering what exactly the req_waitimes measured. Am I reading them correctly?

Thanks,
Kit

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dlm_trace.scratch1
36 kB
25/Jan/13 2:25 PM
dlm_trace.scratch2
36 kB
25/Jan/13 2:25 PM
lnet_data.tgz
53 kB
21/Jan/13 9:24 PM
processed_trace.scratch1
31 kB
25/Jan/13 11:37 AM
processed_trace.scratch2
30 kB
25/Jan/13 11:37 AM
scratch_1_2.deltas
0.9 kB
31/Jan/13 2:40 AM
scratch_1_2.sd_iostats
3 kB
31/Jan/13 2:38 AM
test_plan.txt
2 kB
31/Jan/13 4:54 PM

slow IO at NOAA

Details

Description

Attachments

Attachments

Activity

People

Dates