|
Michael
You have opened this ticket as severity 1 which means that your whole filesystem is out of services - is that the case? From the description it sounds like you are in production but want to root cause what has happened to the impacted files.
Peter
|
|
Hi Peter,
Its has gone up and down a least 2 times since 5pm today. We had it running but shortly after we lose 2 OSS. So basically its not stable and useable at this point. So its not completely down. Sorry for not selecting the correct value. Your help page didn't have info on that selection when I looked.
Mike
|
|
We are seeing errors like this on an OST, see attached.
|
|
No problem Michael - I just wanted to be clear as to what the priority was at this point. It sounds like getting things stable is the first priority.
Shilong - could you please advise?
|
|
Also we continue to have issues with OSTs with these errors and that OSTs won’t stay mounted, and this is causing OSS to consistently reboot
|
|
Update: we haven’t had an OST failover in about 30min so it seems stable, but we have OSTs that won’t fail back without failure.
|
|
Michael,
can you run
lctl get_param osc.*.checksum_type
on one of the clients seeing the error, e.g. 10.31.164.172 or 10.31.163.222
and gather the output?
Thanks
|
|
Hi Dongyang,
The output of that command is attached.
holy2b11102.out
|
|
Hi mre64,
Are the clients, LNet routers, and servers all using ECC memory? Are there any memory errors in the logs on the path from client to server? It would also be good to rule out any network hardware errors?
|
|
Hi John,
Yes they are using ECC memory. I checked the server side Lustre nodes, the lnet routers, and the client nodes that were referenced by the BAD WRITE CHECKSUM and none of them have memory errors.
|
|
Thanks Michael.
Some questions to help us isolate this:
- Are the checksum errors associated with multiple applications?
- Did they start soon after the router upgrade?
- Do they only occur on routed clients?
- Would it be possible to revert the router upgrade and see if the checksum errors stop for RPCs that use the downgraded router?
- Would it be possible to bring some clients to 2.12.4 and see if they occur?
|
|
Hi John,
1. It seems the CHECKSUM error was coming from the same type of application and user. However, that user has been running the same type of jobs for months on the same group of nodes.
2. We have been changing the lnet routers to be exactly the same for past 2-3 weeks, so they were not done all within a short period of time.
3. We saw the BAD WRITE CHECKSUM on both routed and directly connected (via HDR IB) nodes.
4. I don't think we are in the position to change the lnet routers back to 2.13.0. Also since we saw the CHECKSUM error coming from nodes that are not using the lnet routers, it seems the routers are not the problem.
5. We are running a specific kernel and OS so it maybe possible to update the client on some of the compute nodes if 2.12.4 installs without issue or errors. Our compute nodes are running Centos 7.6.1810 3.10.0-957.12.1.el7.x86_64 kernel with OFED INBOX drivers, not MLNX OFED. We would rather not do this to be honest.
|
|
Thanks Michael,
from the output we can see the checksum type is crc32c:
osc.scratch1-OST0009-osc-ffff9ff17a698000.checksum_type=crc32 adler [crc32c]
osc.scratch1-OST001a-osc-ffff9ff17a698000.checksum_type=crc32 adler [crc32c]
You mentioned the clients are using OFED inbox drivers,
are they always using inbox drivers? and the version of it?
what about the lnet router and oss servers? Are they using MOFED and do we have the same version as the clients?
|
|
Hi Dongyang,
The compute clients have always been using the OFED INBOX drivers, for example:
[root@holy7c02108 ~]# modinfo mlx5_ib
filename: /lib/modules/3.10.0-957.12.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
license: Dual BSD/GPL
description: Mellanox Connect-IB HCA IB driver
author: Eli Cohen <eli@mellanox.com>
retpoline: Y
rhelversion: 7.6
srcversion: 3B27ACD7C17E508C4D27B18
depends: mlx5_core,ib_core
intree: Y
vermagic: 3.10.0-957.12.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: 2C:7C:17:70:5C:86:D4:20:80:50:D3:F5:54:56:9A:7B:D3:BF:D1:BF
sig_hashalgo: sha256
The lnet routers are using MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0).
The lustre server is using 4.7.1.0.0.1:
-bash-4.2$ rpm -qa |grep -i mlnx
mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
libibumad-static-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64
mlnx-ofa_kernel-devel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
ibutils2-2.1.1-0.113.MLNX20191121.g1c29603.47329.x86_64
libibumad-devel-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64
libibumad-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64
kmod-mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
|
|
Also, the lnet routers have the MLNX OFED "upstream" drivers which are the latest/greatest while it appears the server side is using the MLNX legacy driver:
Server side:
-bash-4.2$ modinfo mlx5_ib
filename: /lib/modules/3.10.0-1062.1.1.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license: Dual BSD/GPL
description: Mellanox Connect-IB HCA IB driver
author: Eli Cohen <eli@mellanox.com>
retpoline: Y
rhelversion: 7.7
srcversion: 706E928F4D4ECF2659B961F
depends: mlx5_core,ib_core,ib_uverbs,mlx_compat
vermagic: 3.10.0-1062.1.1.el7_lustre.x86_64 SMP mod_unload modversions
parm: dc_cnak_qp_depth:DC CNAK QP depth (uint)
-bash-4.2$ rpm -qf /lib/modules/3.10.0-1062.1.1.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
kmod-mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
However, like I mentioned before we have some compute nodes that are connected to the same HDR fabric as the lustre server (holyscratch01), and don't go thru the lnet routers. Those nodes were also giving the BAD WRITE CHECKSUM error.
|
|
I would like to mention I'm seeing a lot of these server_bulk_callback errors every 3-5 sec on oss03 and also I have seen these errors on oss06 (which has been flakey lately) which we ended up failing all of oss06 OSTs to oss05 because of its instability:
Jun 4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecbd649ae00
Jun 4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec19c3c1000
Jun 4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb2c6e59000
Jun 4 21:49:52 holyscratch01oss03 kernel: LustreError: 10895:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk READ req@ffff9ea2b5496050 x1661185873568080/t0(0) o3->64c2d217-765e-c934-68ad-1480c3b9eac2@10.31.130.246@tcp:2/0 lens 608/440 e 0 to 0 dl 1591321807 ref 1 fl Interpret:/0/0 rc 0/0
Jun 4 21:49:52 holyscratch01oss03 kernel: LustreError: 10895:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 134 previous similar messages
Jun 4 21:49:54 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec449cf6c00
Jun 4 21:50:00 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec19c3c2400
Jun 4 21:50:00 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ef420aeb200
Jun 4 21:50:07 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb2c6e58600
Jun 4 21:50:07 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec449cf0600
Jun 4 21:50:13 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec31bdc2c00
Jun 4 21:50:19 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebbd9c8f200
Jun 4 21:50:19 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ef075f1a800
Jun 4 21:50:26 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec54f615a00
Jun 4 21:50:32 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecbd6499000
Jun 4 21:50:38 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9e9ac410d600
Jun 4 21:50:38 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9e98ca39bc00
Jun 4 21:50:43 holyscratch01oss03 systemd: Starting IML Swap Emitter...
Jun 4 21:50:43 holyscratch01oss03 systemd: Started IML Swap Emitter.
Jun 4 21:50:45 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb49b8b9a00
Jun 4 21:50:57 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecd6d92ba00
Jun 4 21:51:04 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9ee813e90000
Jun 4 21:51:04 holyscratch01oss03 kernel: Lustre: scratch1-OST0014: Bulk IO write error with a10910a5-f0a8-e30e-b540-a728a40bd183 (at 10.31.163.219@o2ib), client will retry: rc = -110
Jun 4 21:51:04 holyscratch01oss03 kernel: Lustre: Skipped 73 previous similar messages
Jun 4 21:51:07 holyscratch01oss03 kernel: LustreError: 11313:0:(sec.c:2485:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 3145728(4194304) req@ffff9eaf5d2a8850 x1644618841213632/t0(0) o4->a5f73f03-2a7d-38af-559e-bc8ad5f84416@10.31.167.138@o2ib:182/0 lens 608/448 e 0 to 0 dl 1591321987 ref 1 fl Interpret:/0/0 rc 0/0
Jun 4 21:51:07 holyscratch01oss03 kernel: LustreError: 11313:0:(sec.c:2485:sptlrpc_svc_unwrap_bulk()) Skipped 68 previous similar messages
Jun 4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb161ddb200
Jun 4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ee83c77a600
Jun 4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebbd9c8f400
Jun 4 21:51:23 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebf48b49a00
Jun 4 21:51:23 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecc98b41c00
Jun 4 21:51:29 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9edc45d7bc00
Jun 4 21:51:29 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ee83c77e800
|
|
Michael,
can you create a file using lfs setstripe to create the stripes on the osts running on oss03(or whichever showing up the errors),
and do a simple dd(both read and write) from the client and see if you are having the checksum errors?
|
|
Hi Dongyang,
I tried that and I am not seeing any checksum errors on the oss03 /var/log/messages file when I run the dd.
|
|
Hi Dongyang, this is what I did exactly:
local OSTs on oss03:
[root@holyscratch01oss03 ~]# df -hl
Filesystem Size Used Avail Use% Mounted on
devtmpfs 189G 0 189G 0% /dev
tmpfs 189G 39M 189G 1% /dev/shm
tmpfs 189G 43M 189G 1% /run
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/mapper/centos_holyscratch01oss03-root 218G 7.7G 211G 4% /
/dev/sda1 1014M 200M 815M 20% /boot
/dev/mapper/mpathg 85T 46T 35T 57% /mnt/scratch1-OST0014
/dev/mapper/mpathe 85T 45T 36T 56% /mnt/scratch1-OST000e
/dev/mapper/mpathi 85T 44T 37T 55% /mnt/scratch1-OST0002
/dev/mapper/mpathj 85T 44T 38T 54% /mnt/scratch1-OST0008
/dev/mapper/mpathm 85T 47T 35T 58% /mnt/scratch1-OST001a
/dev/mapper/mpathc 85T 47T 35T 58% /mnt/scratch1-OST002c
/dev/mapper/mpathn 85T 44T 37T 55% /mnt/scratch1-OST0020
/dev/mapper/mpatha 85T 45T 36T 56% /mnt/scratch1-OST0026
Lists the OSTs on the holyscatch01 lustre FS:
[root@holy2c24108 ~]# lfs osts /n/holyscratch01
OBDS:
0: scratch1-OST0000_UUID ACTIVE
1: scratch1-OST0001_UUID ACTIVE
2: scratch1-OST0002_UUID ACTIVE
3: scratch1-OST0003_UUID ACTIVE
4: scratch1-OST0004_UUID ACTIVE
5: scratch1-OST0005_UUID ACTIVE
6: scratch1-OST0006_UUID ACTIVE
7: scratch1-OST0007_UUID ACTIVE
8: scratch1-OST0008_UUID ACTIVE
9: scratch1-OST0009_UUID ACTIVE
10: scratch1-OST000a_UUID ACTIVE
11: scratch1-OST000b_UUID ACTIVE
12: scratch1-OST000c_UUID ACTIVE
13: scratch1-OST000d_UUID ACTIVE
14: scratch1-OST000e_UUID ACTIVE
15: scratch1-OST000f_UUID ACTIVE
16: scratch1-OST0010_UUID ACTIVE
17: scratch1-OST0011_UUID ACTIVE
18: scratch1-OST0012_UUID ACTIVE
19: scratch1-OST0013_UUID ACTIVE
20: scratch1-OST0014_UUID ACTIVE
21: scratch1-OST0015_UUID ACTIVE
22: scratch1-OST0016_UUID ACTIVE
23: scratch1-OST0017_UUID ACTIVE
24: scratch1-OST0018_UUID ACTIVE
25: scratch1-OST0019_UUID ACTIVE
26: scratch1-OST001a_UUID ACTIVE
27: scratch1-OST001b_UUID ACTIVE
28: scratch1-OST001c_UUID ACTIVE
29: scratch1-OST001d_UUID ACTIVE
30: scratch1-OST001e_UUID ACTIVE
31: scratch1-OST001f_UUID ACTIVE
32: scratch1-OST0020_UUID ACTIVE
33: scratch1-OST0021_UUID ACTIVE
34: scratch1-OST0022_UUID ACTIVE
35: scratch1-OST0023_UUID ACTIVE
36: scratch1-OST0024_UUID ACTIVE
37: scratch1-OST0025_UUID ACTIVE
38: scratch1-OST0026_UUID ACTIVE
39: scratch1-OST0027_UUID ACTIVE
40: scratch1-OST0028_UUID ACTIVE
41: scratch1-OST0029_UUID ACTIVE
42: scratch1-OST002a_UUID ACTIVE
43: scratch1-OST002b_UUID ACTIVE
44: scratch1-OST002c_UUID ACTIVE
45: scratch1-OST002d_UUID ACTIVE
46: scratch1-OST002e_UUID ACTIVE
47: scratch1-OST002f_UUID ACTIVE
Execute the lfs setstripe:
lfs setstripe --ost-list 2,8,14,20,26,32,38,44 /n/holyscratch01/rc_admin/methier/teststripe
The dd write:
[root@holy2c24108 ~]# dd if=/dev/urandom of=/n/holyscratch01/rc_admin/methier/teststripe count=1024 bs=10M
The read I just swapped if with of:
[root@holy2c24108 ~]# dd of=/dev/urandom if=/n/holyscratch01/rc_admin/methier/teststripe count=1024 bs=10M
|
|
[root@holy2c24108 ~]# lfs getstripe /n/holyscratch01/rc_admin/methier/teststripe
/n/holyscratch01/rc_admin/methier/teststripe
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 2
obdidx objid objid group
2 78461277 0x4ad395d 0
8 79385285 0x4bb52c5 0
14 77802814 0x4a32d3e 0
20 77643661 0x4a0bf8d 0
26 77265238 0x49af956 0
32 79283352 0x4b9c498 0
38 78144208 0x4a862d0 0
44 77354743 0x49c56f7 0
|
|
Just to clarify, you are still running 2.12.3 on the servers, and not 2.12.4? There was a known issue with 2.12.3 (LU-13020, LU-13145) that affected LNet under load on larger system that may be contributing to the problem here. It appears that you are seeing the checksum errors because the bulk data transfers are being interrupted.
A workaround to get equivalent behavior for 2.12.4 systems without upgrading or applying a patch is to run the following commands on all of the 2.12.3 nodes in the shown order:
This only temporarily changes these values, but they can be set permanently by adding the following line in /etc/modprobe.d/lnet.conf on all 2.13.3 nodes:
Another thing to try to reduce the severity of the problem, if the above does not help would be setting "lctl set_param osc.*.max_pages_per_rpc=1M" on the clients, which will reduce the number of bulk transfers per RPC, which should at least avoid the checksum errors being reported, and avoid network traffic congestion if there are still transfer errors, since smaller RPCs would also need less resent each time upon hitting an error. This would only affect the current clients, so you could set "lctl set_param obdfilter.*.brw_size=1" on all OSS nodes to limit this for future client mounts as well.
|
|
Hi Andreas,
Thanks for the useful info. Below is what we have set on the lnet routers, do you see an issue with these settings ?
Thanks,
Mike
[root@cannonlnet06 ~]# more /etc/modprobe.d/lustre.conf
options lnet networks="o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)"
options lnet forwarding="enabled"
options lnet lnet_peer_discovery_disabled=1
options lnet lnet_health_sensitivity=0
[root@cannonlnet06 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 0
transaction_timeout: 50
health_sensitivity: 0
recovery_interval: 1
|
|
Also the lustre FS we have checksum errors on is lustre 2.12.3, yes. We have another lustre fs nearby running 2.12.4 and we have not seen checksum errors or instability on it recently. Its not getting hit with as much file I/O most likely as its a lab data storage server. This scratch01 (where checksums errors occur) server gets hammered by 2000 or so compute nodes as a temporary disk space to write/read their compute output.
|
|
Serguei, could you review the LNet parameters in comment-272090.
I would recommend also trying the newer 2.12.4 clients on at least some of the clients to determine if this resolves the issue.
|
|
Hi,
LNet parameters look fine to me. Perhaps ashehata can take a quick look to double-check. Because there appear to be multiple interfaces of the same kind, I'd also recommend checking if Linux routing is setup as outlined here: http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node.
Thanks,
Serguei.
|
|
Hi Andreas,
I tried setting:
echo 150 > /sys/module/lnet/parameters/lnet_transaction_timeout
echo 3 > /sys/module/lnet/parameters/lnet_retry_count
But its not letting me change retry_count:
[root@salt ~]# salt 'holyscratch01*' cmd.run "echo 3 > /sys/module/lnet/parameters/lnet_retry_count"
holyscratch01oss01.rc.fas.harvard.edu:
/bin/sh: line 0: echo: write error: Invalid argument
holyscratch01oss06.rc.fas.harvard.edu:
/bin/sh: /sys/module/lnet/parameters/lnet_retry_count: No such file or directory
holyscratch01oss04.rc.fas.harvard.edu:
/bin/sh: line 0: echo: write error: Invalid argument
holyscratch01oss02.rc.fas.harvard.edu:
holyscratch01oss03.rc.fas.harvard.edu:
/bin/sh: /sys/module/lnet/parameters/lnet_retry_count: No such file or directory
holyscratch01mds02.rc.fas.harvard.edu:
/bin/sh: line 0: echo: write error: Invalid argument
holyscratch01oss05.rc.fas.harvard.edu:
/bin/sh: line 0: echo: write error: Invalid argument
holyscratch01mds01.rc.fas.harvard.edu:
What do you recommend ? Is the order correct ? I change the value transaction_timeout to 150 and tried to change it after that to a lower value and it doesn't let me now. There must be some kind of dependency between these two parameters that doesn't allow you to change them.
Thanks,
Mike
|
|
[root@holyscratch01oss04 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 0
transaction_timeout: 150
health_sensitivity: 0
recovery_interval: 1
[root@holyscratch01oss04 ~]# lnetctl set retry_count 3
add:
- retry_count:
errno: -5
descr: "cannot configure retry count: Invalid argument"
|
|
Hi mre64,
If you have health_sensitivity set to 0, you are prevented from setting non-zero retry_count as the health feature is off. You should still be able to change transaction_timeout though.
Thanks,
Serguei
|
|
Hi Serguei,
Thanks. So in our case, setting lnet_transaction_timeout=150 will be the only setting we should change, per Andreas suggestion ? Does that mean that if retry_county is set to 3 or some other value, its ignored when health is turned off ?
Mike
|
|
Right, if health_sensitivity= then these parameters are ignored, which is fine. It means that LNet will not interrupt incomplete RPCs to retry sending them to the server. Sorry, I didn't realize this was the case.
|
|
Hi Andreas, I went ahead and set this on all the hosts that make up the scratch01 filesystem in /etc/modprobe.d/lustre.conf:
options lnet lnet_health_sensitivity=100 lnet_retry_count=3 lnet_transaction_timeout=150
The reason we had health turned off is it seemed to cause issues for us in the past. Do you recommend we turn health on or off ?
Mike
|
|
Hi Andreas,
I set the following settings, and we still have the Bulk IO and server_bulk_callback messages on 2 of the OSS machines:
options lnet lnet_health_sensitivity=100 lnet_retry_count=3 lnet_transaction_timeout=150 lnet_peer_discovery_disabled=1
[root@holyscratch01oss06 ~]# lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 3
transaction_timeout: 150
health_sensitivity: 100
recovery_interval: 1
Also checked the lctl get_param osc.*.max_pages_per_rpc from one of the clients the gives the bulk I/O and callback messages and its all 1M:
[root@holy2c18214 ~]# lctl get_param osc.*.max_pages_per_rpc |grep scratch
osc.scratch1-OST0000-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0001-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0002-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0003-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0004-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0005-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0006-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0007-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0008-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0009-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST000f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0010-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0011-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0012-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0013-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0014-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0015-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0016-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0017-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0018-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0019-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST001f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0020-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0021-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0022-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0023-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0024-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0025-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0026-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0027-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0028-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST0029-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
osc.scratch1-OST002f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024
We are going to see if we can update the lustre client to 2.12.4. It does build on our older centos 7.6 OS we just have to set options lnet lnet_peer_discovery_disabled=1 in order for it to mount all the lustre FS properly. Can you update 2.10.7 to 2.12.4 while jobs are running and FS are mounted and then reboot to get the new version, or do you have to remove the older version completely and install the new one with no lustre mounts ?
Any other suggestions ?
Thanks,
Mike
|
|
This is what I see on the OSS nodes:
[root@salt ~]# salt 'holyscratch01oss*' cmd.run 'lctl get_param obdfilter.*.brw_size'
holyscratch01oss05.rc.fas.harvard.edu:
obdfilter.scratch1-OST0000.brw_size=4
obdfilter.scratch1-OST0006.brw_size=4
obdfilter.scratch1-OST000c.brw_size=4
obdfilter.scratch1-OST0012.brw_size=4
obdfilter.scratch1-OST0018.brw_size=4
obdfilter.scratch1-OST001e.brw_size=4
obdfilter.scratch1-OST0024.brw_size=4
obdfilter.scratch1-OST002a.brw_size=4
holyscratch01oss02.rc.fas.harvard.edu:
obdfilter.scratch1-OST0004.brw_size=4
obdfilter.scratch1-OST000a.brw_size=4
obdfilter.scratch1-OST0011.brw_size=4
obdfilter.scratch1-OST0016.brw_size=4
obdfilter.scratch1-OST001d.brw_size=4
obdfilter.scratch1-OST0023.brw_size=4
obdfilter.scratch1-OST0029.brw_size=4
obdfilter.scratch1-OST002f.brw_size=4
holyscratch01oss06.rc.fas.harvard.edu:
obdfilter.scratch1-OST0001.brw_size=4
obdfilter.scratch1-OST0007.brw_size=4
obdfilter.scratch1-OST000d.brw_size=4
obdfilter.scratch1-OST0013.brw_size=4
obdfilter.scratch1-OST0019.brw_size=4
obdfilter.scratch1-OST001f.brw_size=4
obdfilter.scratch1-OST0025.brw_size=4
obdfilter.scratch1-OST002b.brw_size=4
holyscratch01oss04.rc.fas.harvard.edu:
obdfilter.scratch1-OST0003.brw_size=4
obdfilter.scratch1-OST0009.brw_size=4
obdfilter.scratch1-OST000f.brw_size=4
obdfilter.scratch1-OST0015.brw_size=4
obdfilter.scratch1-OST001b.brw_size=4
obdfilter.scratch1-OST0021.brw_size=4
obdfilter.scratch1-OST0027.brw_size=4
obdfilter.scratch1-OST002d.brw_size=4
holyscratch01oss01.rc.fas.harvard.edu:
obdfilter.scratch1-OST0005.brw_size=4
obdfilter.scratch1-OST000b.brw_size=4
obdfilter.scratch1-OST0010.brw_size=4
obdfilter.scratch1-OST0017.brw_size=4
obdfilter.scratch1-OST001c.brw_size=4
obdfilter.scratch1-OST0022.brw_size=4
obdfilter.scratch1-OST0028.brw_size=4
obdfilter.scratch1-OST002e.brw_size=4
holyscratch01oss03.rc.fas.harvard.edu:
obdfilter.scratch1-OST0002.brw_size=4
obdfilter.scratch1-OST0008.brw_size=4
obdfilter.scratch1-OST000e.brw_size=4
obdfilter.scratch1-OST0014.brw_size=4
obdfilter.scratch1-OST001a.brw_size=4
obdfilter.scratch1-OST0020.brw_size=4
obdfilter.scratch1-OST0026.brw_size=4
obdfilter.scratch1-OST002c.brw_size=4
|
|
Note that "osc.*.max_pages_per_rpc=1024" is 4MB (with PAGE_SIZE=4096 on x86 clients). This matches with "obdfilter.*.brw_size=4" on the OSTs. If you set "...max_pages_per_rpc=1M" it is internally converted to 256x 4KB pages.
|
|
Hi,
So nothing has worked so far so solve our problem with the bulk errors and we can only run on 4 of the 6 oss for now. We are going to try to update the lustre client side to 2.12.4 as soon as we can. Also the scratch01 lustre server that has issues is running 2.12.3 and at some point we will update that to 2.12.4.
Thanks,
Mike
|
|
Hello,
We upgraded all our 2150 compute node's lustre client to 2.12.4 this past Monday and it seems to have greatly stabilized things for us. Apparently this was the main fix we needed. So we have most devices (not all) on 2.12.4 (clients, lnet routers and Lustre FS). We have some main lustre FS on v2.12.3.
We are planning on upgrading our compute nodes to the Centos 7.8 OS in October. Do you recommend we stay on 2.12.4 or update the lustre clients to 2.12.5 ? Our lustre storage will most likely stay on 2.12.3 or 2.12.4 for a while more.
Thanks,
Mike
|
|
Mike
I think that you'll need to move to 2.12.5 in order to get support for CentOS 7.8
Peter
|
Generated at Sat Feb 10 03:02:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.