Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 2.11.0
-
None
-
Lustre 2.11.0, ZFS 0.7.9, CentOS 7.5, Omni-Path
-
2
-
9223372036854775807
Description
Hi everyone,
a v2.11.0 installation, there is an ongoing file transfer in Lustre running on one of the client nodes. Every now and then the process stops with I/O errors and then the file systems doesn't respond. After a while the access to the FS resumes, but the file transfer aborts with an error by that time.
`lctl get_param*.*state` shows a history of evictions of targets:
mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state= current_state: FULL state_history: - [ 1546444758, CONNECTING ] - [ 1546444758, FULL ] <...> - [ 1546594365, DISCONN ] - [ 1546594388, CONNECTING ] - [ 1546594388, RECOVER ] - [ 1546594388, FULL ] mgc.MGC10.149.0.183@o2ib.state= current_state: FULL state_history: - [ 1546444756, CONNECTING ] - [ 1546444756, FULL ] <...> - [ 1546594354, DISCONN ] - [ 1546594363, CONNECTING ] - [ 1546594363, EVICTED ] - [ 1546594363, RECOVER ] - [ 1546594363, FULL ] osc.scratch-OST0000-osc-ffff8ecec4bf2800.state= current_state: CONNECTING state_history: - [ 1548316546, DISCONN ] - [ 1548316551, CONNECTING ] <...> - [ 1548316851, CONNECTING ] - [ 1548316921, DISCONN ] - [ 1548316926, CONNECTING ] - [ 1548316996, DISCONN ] - [ 1548317001, CONNECTING ] - [ 1548317071, DISCONN ] - [ 1548317076, CONNECTING ]
and so on for all the affected by the file transfer OSTs.
In the OSS's logs there are all sorts of connectivity problems reported both by Lustre:
ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie 13579970295878280357 ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous similar messages ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.149.255.238@o2ib ns: filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc: 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid: 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849 timeout: 3689597 lvb_type: 0 ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar messages ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to /tmp/lustre-log.1548315195.186010 ods01: Jan 24 08:44:41 ods01 kernel: Lustre: 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548314995/real 1548315881] req@ffff88975cc9d100 x1619663992901664/t0(0) o104->scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1
and LNet:
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous similar messages ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with 10.149.255.254@o2ib (52): c: 0, oc: 0, rc: 63 ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous similar messages
while `opareport` doesn't detect any Omni-Path errors at all.
From the client side:
Jan 24 05:43:03 admin kernel: LustreError: 133-1: scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from 10.149.0.185@o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4 Jan 24 05:43:03 admin kernel: LustreError: 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8ecb19cd0000 x1621600181397872/t0(0) o3->scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860
and on another client:
Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1548316966/real 1548316966] req@ffff995a6dab4e00 x1623278997809600/t0(0) o101->scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous similar messages
Kind regards,
Konstantin