Details
-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6, Sherlock Cluster clients: kernel 3.10.0-957.5.1.el7.x86_64, lustre-client 2.12.0 (from wc), Server: Fir running 2.12.0 Kernel 3.10.0-957.1.3.el7_lustre.x86_64
-
3
-
9223372036854775807
Description
Hello! We started production on 2.12 clients and 2.12 servers (scratch filesystem) last week (we still have Oak servers in 2.10 also mounted on Sherlock). The cluster has been stabilized but now we have a major issue with slow clients. Some clients are slow and we've been trying to figure out all day why without success. Other clients are just run fine, only some of them are slow. Hopefully someone will have some clue as this leads to many unhappy users at the moment...
Let's take two Lustre 2.12 clients, on the same IB fabric (we have two separate fabrics on this cluster), using the same lustre routers, also they are using the same hardware and same OS image:
sh-ln05 is very slow at the moment, a simple dd to /fir leads to:
[root@sh-ln05 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 131.621 s, 8.0 MB/s
sh-ln06 on the same file (same stripping) runs just fine:
[root@sh-ln06 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.52442 s, 688 MB/s
Both of these nodes are in production with a medium load. On some other nodes, less loaded, I get 1.2GB/s with the same dd.
We started with Large Bulk I/O 16MB and tried to revert to 4MB but this didn't change anything. On those clients, we tried random things like to clear the lru, drop caches, swapoff with no luck. NRS is off so using fifo.
We use DoM + PFL. Example for this file seqddout1M:
[root@sh-ln06 sthiell]# lfs getstripe seqddout1M
seqddout1M
lcm_layout_gen: 9
lcm_mirror_count: 1
lcm_entry_count: 6
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 131072
lmm_stripe_count: 0
lmm_stripe_size: 131072
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 131072
lcme_extent.e_end: 16777216
lmm_stripe_count: 1
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 41
lmm_objects:
- 0: { l_ost_idx: 41, l_fid: [0x100290000:0xb3f46:0x0] }
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 16777216
lcme_extent.e_end: 1073741824
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 26
lmm_objects:
- 0: { l_ost_idx: 26, l_fid: [0x1001a0000:0xb3f5c:0x0] }
- 1: { l_ost_idx: 19, l_fid: [0x100130000:0xb401e:0x0] }
lcme_id: 4
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 34359738368
lmm_stripe_count: 4
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 9
lmm_objects:
- 0: { l_ost_idx: 9, l_fid: [0x100090000:0xb41eb:0x0] }
- 1: { l_ost_idx: 43, l_fid: [0x1002b0000:0xb3f4a:0x0] }
- 2: { l_ost_idx: 42, l_fid: [0x1002a0000:0xb408a:0x0] }
- 3: { l_ost_idx: 2, l_fid: [0x100020000:0xb3f50:0x0] }
lcme_id: 5
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 34359738368
lcme_extent.e_end: 274877906944
lmm_stripe_count: 8
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lcme_id: 6
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 274877906944
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
Other client config:
osc.fir-OST*.max_dirty_mb=256
osc.fir-OST*.max_pages_per_rpc=1024
osc.fir-OST*.max_rpcs_in_flight=8
I have a full lustre log during a slow dd on sh-ln05 that I have attached to this ticket.
Note: we are seeing the same behavior when using /fir (2.12 servers) and /oak (2.10 servers). This definitively seems to originate from the Lustre client itself.
We also checked the state of IB and everything looks good.
Any idea on how to find out the root cause of this major random 2.12 client slowness ?
Thanks!!
Stephane