[LU-11957] Slow clients Created: 12/Feb/19  Updated: 01/Apr/22  Resolved: 12/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Patrick Farrell (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 7.6, Sherlock Cluster clients: kernel 3.10.0-957.5.1.el7.x86_64, lustre-client 2.12.0 (from wc), Server: Fir running 2.12.0 Kernel 3.10.0-957.1.3.el7_lustre.x86_64


Attachments: File lustre-log-sh-ln05.log.gz    
Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello! We started production on 2.12 clients and 2.12 servers (scratch filesystem) last week (we still have Oak servers in 2.10 also mounted on Sherlock). The cluster has been stabilized but now we have a major issue with slow clients. Some clients are slow and we've been trying to figure out all day why without success. Other clients are just run fine, only some of them are slow. Hopefully someone will have some clue as this leads to many unhappy users at the moment...

Let's take two Lustre 2.12 clients, on the same IB fabric (we have two separate fabrics on this cluster), using the same lustre routers, also they are using the same hardware and same OS image:

sh-ln05 is very slow at the moment, a simple dd to /fir leads to:

[root@sh-ln05 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 131.621 s, 8.0 MB/s

sh-ln06 on the same file (same stripping) runs just fine:

[root@sh-ln06 sthiell]# dd if=/dev/zero of=seqddout1M bs=1M count=1000 conv=fsync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.52442 s, 688 MB/s

Both of these nodes are in production with a medium load. On some other nodes, less loaded, I get 1.2GB/s with the same dd.

We started with Large Bulk I/O 16MB and tried to revert to 4MB but this didn't change anything. On those clients, we tried random things like to clear the lru, drop caches, swapoff with no luck. NRS is off so using fifo.

We use DoM + PFL. Example for this file seqddout1M:

[root@sh-ln06 sthiell]# lfs getstripe seqddout1M 
seqddout1M
  lcm_layout_gen:    9
  lcm_mirror_count:  1
  lcm_entry_count:   6
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   131072
      lmm_stripe_count:  0
      lmm_stripe_size:   131072
      lmm_pattern:       mdt
      lmm_layout_gen:    0
      lmm_stripe_offset: 0

    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 131072
    lcme_extent.e_end:   16777216
      lmm_stripe_count:  1
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 41
      lmm_objects:
      - 0: { l_ost_idx: 41, l_fid: [0x100290000:0xb3f46:0x0] }

    lcme_id:             3
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 16777216
    lcme_extent.e_end:   1073741824
      lmm_stripe_count:  2
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 26
      lmm_objects:
      - 0: { l_ost_idx: 26, l_fid: [0x1001a0000:0xb3f5c:0x0] }
      - 1: { l_ost_idx: 19, l_fid: [0x100130000:0xb401e:0x0] }

    lcme_id:             4
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 1073741824
    lcme_extent.e_end:   34359738368
      lmm_stripe_count:  4
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 9
      lmm_objects:
      - 0: { l_ost_idx: 9, l_fid: [0x100090000:0xb41eb:0x0] }
      - 1: { l_ost_idx: 43, l_fid: [0x1002b0000:0xb3f4a:0x0] }
      - 2: { l_ost_idx: 42, l_fid: [0x1002a0000:0xb408a:0x0] }
      - 3: { l_ost_idx: 2, l_fid: [0x100020000:0xb3f50:0x0] }

    lcme_id:             5
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 34359738368
    lcme_extent.e_end:   274877906944
      lmm_stripe_count:  8
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

    lcme_id:             6
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 274877906944
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  16
      lmm_stripe_size:   4194304
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

Other client config:
osc.fir-OST*.max_dirty_mb=256
osc.fir-OST*.max_pages_per_rpc=1024
osc.fir-OST*.max_rpcs_in_flight=8

I have a full lustre log during a slow dd on sh-ln05 that I have attached to this ticket.

Note: we are seeing the same behavior when using /fir (2.12 servers) and /oak (2.10 servers). This definitively seems to originate from the Lustre client itself.

We also checked the state of IB and everything looks good.

Any idea on how to find out the root cause of this major random 2.12 client slowness ?

Thanks!!
Stephane



 Comments   
Comment by Stephane Thiell [ 12/Feb/19 ]

I've also performed some lnet_selftest benchmarks between Fir and different Lustre clients, impacted by the slow I/Os or not, and there is no significant difference between them.

This is terrible:

[root@sh-ln05 sthiell]# uptime
 20:15:14 up 5 days, 23:41, 21 users,  load average: 1.03, 1.12, 0.93

[root@sh-ln05 sthiell]# strace -T -ewrite -f dd if=/dev/zero of=seqddout1M bs=4M count=1000 conv=fsync
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.067278>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.061963>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.063141>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.062815>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.395379>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.309614>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.278361>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.377247>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.227987>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.221452>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.261344>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195638>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195916>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.868780>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.257614>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.300075>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.226142>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.584341>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.195298>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.299334>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.376611>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.226417>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.199998>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.277209>

A working client (sh-ln06):

[root@sh-ln06 ~]# uptime
 20:15:31 up 3 days, 12:52, 25 users,  load average: 2.43, 2.43, 2.10
[root@sh-ln06 sthiell]# strace -T -ewrite -f dd if=/dev/zero of=seqddout1M bs=4M count=1000 conv=fsync
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.004063>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003959>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003832>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003740>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003756>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003621>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003753>
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4194304) = 4194304 <0.003889>

There is no Lustre logs on both clients and servers so this is super weird.

I've checked different rpc_stats but didn't find anything obvious (but maybe you will..).. example with fir-OST0000:

[root@sh-ln05 ~]# lctl get_param osc.fir-OST0000*.rpc_stats
osc.fir-OST0000-osc-ffff91bec5338800.rpc_stats=
snapshot_time:         1549945049.570089887 (secs.nsecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

			read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:		        12   0   0   |        332   6   6
2:		         4   0   0   |        266   5  12
4:		        57   2   3   |        282   5  18
8:		         1   0   3   |        123   2  20
16:		         2   0   3   |         26   0  21
32:		        14   0   4   |         25   0  22
64:		         6   0   4   |         44   0  22
128:		        36   1   6   |         74   1  24
256:		       326  15  21   |        190   3  28
512:		       242  11  33   |        584  12  40
1024:		      1387  66 100   |       2844  59 100

			read			write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:		         0   0   0   |          0   0   0
1:		      1550  74  74   |       3376  70  70
2:		       432  20  94   |        800  16  87
3:		        94   4  99   |        316   6  93
4:		        11   0 100   |        157   3  97
5:		         0   0 100   |         71   1  98
6:		         0   0 100   |         30   0  99
7:		         0   0 100   |         18   0  99
8:		         0   0 100   |         14   0  99
9:		         0   0 100   |          8   0 100

			read			write
offset                rpcs   % cum % |       rpcs   % cum %
0:		         0   0   0   |          0   0   0
1:		         0   0   0   |          0   0   0
2:		         0   0   0   |          0   0   0
4:		         0   0   0   |          0   0   0
8:		         0   0   0   |          0   0   0
16:		         0   0   0   |          0   0   0
32:		       255  12  12   |       1764  36  36
64:		         5   0  12   |          3   0  36
128:		         2   0  12   |          4   0  36
256:		        62   2  15   |         10   0  37
512:		        50   2  17   |         15   0  37
1024:		       101   4  22   |        126   2  40
2048:		       330  15  38   |        468   9  49
4096:		       237  11  49   |        565  11  61
8192:		       315  15  65   |        558  11  73
16384:		       197   9  74   |        626  13  86
32768:		       189   9  83   |        244   5  91
65536:		       344  16 100   |        219   4  96
131072:		         0   0 100   |        188   3 100

Working client:

[root@sh-ln06 ~]# lctl get_param osc.fir-OST0000*.rpc_stats
osc.fir-OST0000-osc-ffff9bad01395000.rpc_stats=
snapshot_time:         1549945096.503096308 (secs.nsecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

			read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:		        21   0   0   |         53   0   0
2:		       136   1   2   |         82   1   2
4:		       169   2   4   |         79   1   3
8:		       155   1   6   |         30   0   3
16:		        70   0   7   |         32   0   4
32:		       891  11  18   |        103   1   5
64:		        60   0  19   |        150   2   7
128:		       273   3  22   |        190   2  10
256:		      1248  16  38   |        168   2  13
512:		      1315  16  55   |        247   3  17
1024:		      3461  44 100   |       5514  82  99
2048:		         0   0 100   |          1   0 100

			read			write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:		         0   0   0   |          0   0   0
1:		      7205  92  92   |       5751  86  86
2:		       517   6  99   |        508   7  94
3:		        58   0  99   |        118   1  95
4:		        14   0  99   |         68   1  96
5:		         3   0  99   |         49   0  97
6:		         2   0 100   |         41   0  98
7:		         0   0 100   |         37   0  98
8:		         0   0 100   |         35   0  99
9:		         0   0 100   |         42   0 100

			read			write
offset                rpcs   % cum % |       rpcs   % cum %
0:		         0   0   0   |          0   0   0
1:		         0   0   0   |          0   0   0
2:		         0   0   0   |          0   0   0
4:		         0   0   0   |          0   0   0
8:		         0   0   0   |          0   0   0
16:		         0   0   0   |          0   0   0
32:		       308   3   3   |        303   4   4
64:		         6   0   4   |          1   0   4
128:		         9   0   4   |         12   0   4
256:		       118   1   5   |          8   0   4
512:		       124   1   7   |         24   0   5
1024:		       192   2   9   |        236   3   8
2048:		      1370  17  27   |        893  13  22
4096:		       970  12  39   |        549   8  30
8192:		      1246  15  55   |        789  11  42
16384:		       595   7  63   |        652   9  52
32768:		       305   3  67   |        835  12  64
65536:		       959  12  79   |        919  13  78
131072:		       215   2  82   |        146   2  80
262144:		       190   2  84   |        258   3  84
524288:		       630   8  92   |        515   7  92
1048576:		       562   7 100   |        509   7 100
Comment by Peter Jones [ 12/Feb/19 ]

Patrick is investigating

Comment by Stephane Thiell [ 12/Feb/19 ]

Thanks Peter!

Patrick, we've just dumped a compute node that was in that "slow state", perhaps that could help? Also, this is the result of a simple dd in MB/s to /fir on all our nodes, just to give an idea of the scope of the issue (=> most of the cluster is running fine):

# awk '{gsub(/\/s/,""); if ($NF=="GB") {print $(NF-1)*1000 } else {print $(NF-1) } }' stateall | histogram.py
# NumSamples = 1306; Min = 7.20; Max = 1500.00
# Mean = 1079.677106; Variance = 95975.997401; SD = 309.799931; Median 1200.000000
# each ∎ represents a count of 6
    7.2000 -   156.4800 [    20]: ∎∎∎
  156.4800 -   305.7600 [    35]: ∎∎∎∎∎
  305.7600 -   455.0400 [    37]: ∎∎∎∎∎∎
  455.0400 -   604.3200 [    27]: ∎∎∎∎
  604.3200 -   753.6000 [    72]: ∎∎∎∎∎∎∎∎∎∎∎∎
  753.6000 -   902.8800 [    93]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  902.8800 -  1052.1600 [   126]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1052.1600 -  1201.4400 [   484]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1201.4400 -  1350.7200 [   245]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1350.7200 -  1500.0000 [   167]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[root@sh-hn01 tmp]#

We still have one login node that is super slow, sh-ln05. It has been closed to new connections since we discovered the problem but users are still connected. Let me know what would be useful to have from this one. dd still shows a speed of ~10MB/s max...

 

Comment by Patrick Farrell (Inactive) [ 12/Feb/19 ]

Stephane,

The RPC stats are very interesting...

Could you clear them and collect again? It's hard to believe you've got ~700 MB/s on one client and 8 MB/s on another but both have only 1 RPC in flight...  Maybe you've just got super fast OSTs.

In any case, the stats don't quite bear it out, but this still looks a lot like a grant problem.

Can you run lctl get_param osc.[the OST we're focusing on.grant on a good client and a not good client?

If you can, I'd also suggest remounting the "not good" client(s).

Comment by Stephane Thiell [ 12/Feb/19 ]

Remounting the slow clients seems to fix the grant issue as their get their full speed back. It's worth noting that in our case, due to LU-11939, we had to change max_dirty_mb live (to reduce it from 2000 to 256, for now). This could have led to the grant issues we're seeing.

 

An easy way to detect clients having grant issues is to check the osc rpc_stats on the clients:

 $ lctl set_param osc.fir-OST00*.rpc_stats=clear
 # perform I/Os
 $ lctl get_param osc.fir-OST00*.rpc_stats
 

Then check the "rpcs in flight", for instance on a slow client we had this:

 rpcs in flight rpcs % cum % | rpcs % cum %
 0: 0 0 0 | 0 0 0
 1: 0 0 0 | 39 100 100
 

And on a good/normal/fast client:

 rpcs in flight rpcs % cum % | rpcs % cum %
 0: 0 0 0 | 0 0 0
 1: 0 0 0 | 6 2 2
 2: 0 0 0 | 86 33 36
 3: 0 0 0 | 15 5 42
 4: 0 0 0 | 16 6 48
 5: 0 0 0 | 15 5 54
 6: 0 0 0 | 18 7 61
 7: 0 0 0 | 29 11 73
 8: 0 0 0 | 27 10 83
 9: 0 0 0 | 38 15 98
 10: 0 0 0 | 3 1 100
 

We're currently in the process of detecting all clients having this issue (we have ~1,500 to check) and we'll remount Lustre on these. Thanks Patrick!

Stephane

Comment by Patrick Farrell (Inactive) [ 12/Feb/19 ]

Thanks Stephane.

Generated at Sat Feb 10 02:48:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.