Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.2
-
None
-
L2.12.2 server & L2.10.8 Server
L2.12.2 client & L2.11 Client
-
2
-
9223372036854775807
Description
We are seeing
Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib) Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0 Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110 Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
Client Configs
r593i6n16 ~ # lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0 retry_count: 0 transaction_timeout: 100 health_sensitivity: 0 recovery_interval: 1 ---- options ko2iblnd require_privileged_port=0 use_privileged_port=0 options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32 options lnet networks=o2ib(ib1) options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192 options lnet lnet_transaction_timeout=100 options ptlrpc at_max=600 at_min=200
Server Configs
nbp7-oss7 ~ # lnetctl global show
global:
numa_range: 0
max_intf: 200
discovery: 1
drop_asym_route: 0
retry_count: 0
transaction_timeout: 100
health_sensitivity: 0
recovery_interval: 1
----
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
#lnet
options lnet networks=o2ib(ib1)
options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
options lnet dead_router_check_interval=60 live_router_check_interval=30
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
options ptlrpc at_max=600 at_min=200
These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.