[LU-11957] Slow clients Created: 12/Feb/19 Updated: 01/Apr/22 Resolved: 12/Feb/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6, Sherlock Cluster clients: kernel 3.10.0-957.5.1.el7.x86_64, lustre-client 2.12.0 (from wc), Server: Fir running 2.12.0 Kernel 3.10.0-957.1.3.el7_lustre.x86_64 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Hello! We started production on 2.12 clients and 2.12 servers (scratch filesystem) last week (we still have Oak servers in 2.10 also mounted on Sherlock). The cluster has been stabilized but now we have a major issue with slow clients. Some clients are slow and we've been trying to figure out all day why without success. Other clients are just run fine, only some of them are slow. Hopefully someone will have some clue as this leads to many unhappy users at the moment... Let's take two Lustre 2.12 clients, on the same IB fabric (we have two separate fabrics on this cluster), using the same lustre routers, also they are using the same hardware and same OS image: sh-ln05 is very slow at the moment, a simple dd to /fir leads to: [root@sh-ln05 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 131.621 s, 8.0 MB/s sh-ln06 on the same file (same stripping) runs just fine: [root@sh-ln06 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.52442 s, 688 MB/s Both of these nodes are in production with a medium load. On some other nodes, less loaded, I get 1.2GB/s with the same dd. We started with Large Bulk I/O 16MB and tried to revert to 4MB but this didn't change anything. On those clients, we tried random things like to clear the lru, drop caches, swapoff with no luck. NRS is off so using fifo. We use DoM + PFL. Example for this file seqddout1M: [root@sh-ln06 sthiell]# lfs getstripe seqddout1M
seqddout1M
lcm_layout_gen: 9
lcm_mirror_count: 1
lcm_entry_count: 6
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 131072
lmm_stripe_count: 0
lmm_stripe_size: 131072
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 131072
lcme_extent.e_end: 16777216
lmm_stripe_count: 1
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 41
lmm_objects:
- 0: { l_ost_idx: 41, l_fid: [0x100290000:0xb3f46:0x0] }
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 16777216
lcme_extent.e_end: 1073741824
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 26
lmm_objects:
- 0: { l_ost_idx: 26, l_fid: [0x1001a0000:0xb3f5c:0x0] }
- 1: { l_ost_idx: 19, l_fid: [0x100130000:0xb401e:0x0] }
lcme_id: 4
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 34359738368
lmm_stripe_count: 4
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 9
lmm_objects:
- 0: { l_ost_idx: 9, l_fid: [0x100090000:0xb41eb:0x0] }
- 1: { l_ost_idx: 43, l_fid: [0x1002b0000:0xb3f4a:0x0] }
- 2: { l_ost_idx: 42, l_fid: [0x1002a0000:0xb408a:0x0] }
- 3: { l_ost_idx: 2, l_fid: [0x100020000:0xb3f50:0x0] }
lcme_id: 5
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 34359738368
lcme_extent.e_end: 274877906944
lmm_stripe_count: 8
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lcme_id: 6
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 274877906944
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
Other client config: I have a full lustre log during a slow dd on sh-ln05 that I have attached to this ticket. Note: we are seeing the same behavior when using /fir (2.12 servers) and /oak (2.10 servers). This definitively seems to originate from the Lustre client itself. We also checked the state of IB and everything looks good. Any idea on how to find out the root cause of this major random 2.12 client slowness ? Thanks!! |
| Comments |
| Comment by Stephane Thiell [ 12/Feb/19 ] |
|
I've also performed some lnet_selftest benchmarks between Fir and different Lustre clients, impacted by the slow I/Os or not, and there is no significant difference between them. This is terrible: [root@sh-ln05 sthiell]# uptime 20:15:14 up 5 days, 23:41, 21 users, load average: 1.03, 1.12, 0.93 [root@sh-ln05 sthiell]# strace -T -ewrite -f dd if=/dev/zero of=seqddout1M bs=4M count=1000 conv=fsync write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.067278> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.061963> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.063141> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.062815> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.395379> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.309614> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.278361> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.377247> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.227987> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.221452> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.261344> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195638> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195916> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.868780> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.257614> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.300075> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.226142> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.584341> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195298> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.299334> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.376611> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.226417> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.199998> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.277209> A working client (sh-ln06): [root@sh-ln06 ~]# uptime 20:15:31 up 3 days, 12:52, 25 users, load average: 2.43, 2.43, 2.10 [root@sh-ln06 sthiell]# strace -T -ewrite -f dd if=/dev/zero of=seqddout1M bs=4M count=1000 conv=fsync write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.004063> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003959> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003832> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003740> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003756> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003621> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003753> write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003889> There is no Lustre logs on both clients and servers so this is super weird. I've checked different rpc_stats but didn't find anything obvious (but maybe you will..).. example with fir-OST0000: [root@sh-ln05 ~]# lctl get_param osc.fir-OST0000*.rpc_stats osc.fir-OST0000-osc-ffff91bec5338800.rpc_stats= snapshot_time: 1549945049.570089887 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 12 0 0 | 332 6 6 2: 4 0 0 | 266 5 12 4: 57 2 3 | 282 5 18 8: 1 0 3 | 123 2 20 16: 2 0 3 | 26 0 21 32: 14 0 4 | 25 0 22 64: 6 0 4 | 44 0 22 128: 36 1 6 | 74 1 24 256: 326 15 21 | 190 3 28 512: 242 11 33 | 584 12 40 1024: 1387 66 100 | 2844 59 100 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 1550 74 74 | 3376 70 70 2: 432 20 94 | 800 16 87 3: 94 4 99 | 316 6 93 4: 11 0 100 | 157 3 97 5: 0 0 100 | 71 1 98 6: 0 0 100 | 30 0 99 7: 0 0 100 | 18 0 99 8: 0 0 100 | 14 0 99 9: 0 0 100 | 8 0 100 read write offset rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 0 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 255 12 12 | 1764 36 36 64: 5 0 12 | 3 0 36 128: 2 0 12 | 4 0 36 256: 62 2 15 | 10 0 37 512: 50 2 17 | 15 0 37 1024: 101 4 22 | 126 2 40 2048: 330 15 38 | 468 9 49 4096: 237 11 49 | 565 11 61 8192: 315 15 65 | 558 11 73 16384: 197 9 74 | 626 13 86 32768: 189 9 83 | 244 5 91 65536: 344 16 100 | 219 4 96 131072: 0 0 100 | 188 3 100 Working client: [root@sh-ln06 ~]# lctl get_param osc.fir-OST0000*.rpc_stats osc.fir-OST0000-osc-ffff9bad01395000.rpc_stats= snapshot_time: 1549945096.503096308 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 21 0 0 | 53 0 0 2: 136 1 2 | 82 1 2 4: 169 2 4 | 79 1 3 8: 155 1 6 | 30 0 3 16: 70 0 7 | 32 0 4 32: 891 11 18 | 103 1 5 64: 60 0 19 | 150 2 7 128: 273 3 22 | 190 2 10 256: 1248 16 38 | 168 2 13 512: 1315 16 55 | 247 3 17 1024: 3461 44 100 | 5514 82 99 2048: 0 0 100 | 1 0 100 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 7205 92 92 | 5751 86 86 2: 517 6 99 | 508 7 94 3: 58 0 99 | 118 1 95 4: 14 0 99 | 68 1 96 5: 3 0 99 | 49 0 97 6: 2 0 100 | 41 0 98 7: 0 0 100 | 37 0 98 8: 0 0 100 | 35 0 99 9: 0 0 100 | 42 0 100 read write offset rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 0 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 308 3 3 | 303 4 4 64: 6 0 4 | 1 0 4 128: 9 0 4 | 12 0 4 256: 118 1 5 | 8 0 4 512: 124 1 7 | 24 0 5 1024: 192 2 9 | 236 3 8 2048: 1370 17 27 | 893 13 22 4096: 970 12 39 | 549 8 30 8192: 1246 15 55 | 789 11 42 16384: 595 7 63 | 652 9 52 32768: 305 3 67 | 835 12 64 65536: 959 12 79 | 919 13 78 131072: 215 2 82 | 146 2 80 262144: 190 2 84 | 258 3 84 524288: 630 8 92 | 515 7 92 1048576: 562 7 100 | 509 7 100 |
| Comment by Peter Jones [ 12/Feb/19 ] |
|
Patrick is investigating |
| Comment by Stephane Thiell [ 12/Feb/19 ] |
|
Thanks Peter! Patrick, we've just dumped a compute node that was in that "slow state", perhaps that could help? Also, this is the result of a simple dd in MB/s to /fir on all our nodes, just to give an idea of the scope of the issue (=> most of the cluster is running fine): # awk '{gsub(/\/s/,""); if ($NF=="GB") {print $(NF-1)*1000 } else {print $(NF-1) } }' stateall | histogram.py
# NumSamples = 1306; Min = 7.20; Max = 1500.00
# Mean = 1079.677106; Variance = 95975.997401; SD = 309.799931; Median 1200.000000
# each ∎ represents a count of 6
7.2000 - 156.4800 [ 20]: ∎∎∎
156.4800 - 305.7600 [ 35]: ∎∎∎∎∎
305.7600 - 455.0400 [ 37]: ∎∎∎∎∎∎
455.0400 - 604.3200 [ 27]: ∎∎∎∎
604.3200 - 753.6000 [ 72]: ∎∎∎∎∎∎∎∎∎∎∎∎
753.6000 - 902.8800 [ 93]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
902.8800 - 1052.1600 [ 126]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1052.1600 - 1201.4400 [ 484]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1201.4400 - 1350.7200 [ 245]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1350.7200 - 1500.0000 [ 167]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[root@sh-hn01 tmp]#
We still have one login node that is super slow, sh-ln05. It has been closed to new connections since we discovered the problem but users are still connected. Let me know what would be useful to have from this one. dd still shows a speed of ~10MB/s max...
|
| Comment by Patrick Farrell (Inactive) [ 12/Feb/19 ] |
|
Stephane, The RPC stats are very interesting... Could you clear them and collect again? It's hard to believe you've got ~700 MB/s on one client and 8 MB/s on another but both have only 1 RPC in flight... Maybe you've just got super fast OSTs. In any case, the stats don't quite bear it out, but this still looks a lot like a grant problem. Can you run lctl get_param osc.[the OST we're focusing on.grant on a good client and a not good client? If you can, I'd also suggest remounting the "not good" client(s). |
| Comment by Stephane Thiell [ 12/Feb/19 ] |
|
Remounting the slow clients seems to fix the grant issue as their get their full speed back. It's worth noting that in our case, due to
An easy way to detect clients having grant issues is to check the osc rpc_stats on the clients: $ lctl set_param osc.fir-OST00*.rpc_stats=clear # perform I/Os $ lctl get_param osc.fir-OST00*.rpc_stats Then check the "rpcs in flight", for instance on a slow client we had this: rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 0 0 0 | 39 100 100 And on a good/normal/fast client: rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 0 0 0 | 6 2 2 2: 0 0 0 | 86 33 36 3: 0 0 0 | 15 5 42 4: 0 0 0 | 16 6 48 5: 0 0 0 | 15 5 54 6: 0 0 0 | 18 7 61 7: 0 0 0 | 29 11 73 8: 0 0 0 | 27 10 83 9: 0 0 0 | 38 15 98 10: 0 0 0 | 3 1 100 We're currently in the process of detecting all clients having this issue (we have ~1,500 to check) and we'll remount Lustre on these. Thanks Patrick! Stephane |
| Comment by Patrick Farrell (Inactive) [ 12/Feb/19 ] |
|
Thanks Stephane. |