Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.2
-
None
-
L2.12.2 server & L2.10.8 Server
L2.12.2 client & L2.11 Client
-
2
-
9223372036854775807
Description
We are seeing
Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib) Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
Client Configs
r593i6n16 ~ # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0 retry_count: 0 transaction_timeout: 100 health_sensitivity: 0 recovery_interval: 1 ---- options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32 options lnet networks=o2ib(ib1) options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 options lnet lnet_transaction_timeout=100 options ptlrpc at_max=600 at_min=200
Server Configs
nbp7-oss7 ~ # lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 100
health_sensitivity: 0
recovery_interval: 1
----
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
#lnet
options lnet networks=o2ib(ib1)
options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
options lnet dead_router_check_interval=60 live_router_check_interval=30
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
options ptlrpc at_max=600 at_min=200
These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.
I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients it did not reproduce the bulk timeout issue.
So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.