[LU-11886] OSTs and MDTs become unreachable under load, 2.11 Created: 25/Jan/19  Updated: 25/Jan/19  Resolved: 25/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: cv_eng (Inactive) Assignee: Peter Jones
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre 2.11.0, ZFS 0.7.9, CentOS 7.5, Omni-Path


Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Hi everyone,
 
a v2.11.0 installation, there is an ongoing file transfer in Lustre running on one of the client nodes. Every now and then the process stops with I/O errors and then the file systems doesn't respond. After a while the access to the FS resumes, but the file transfer aborts with an error by that time.
 
`lctl get_param*.*state` shows a history of evictions of targets:
 

mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state=
current_state: FULL
state_history:
 - [ 1546444758, CONNECTING ]
 - [ 1546444758, FULL ]
 <...>
 - [ 1546594365, DISCONN ]
 - [ 1546594388, CONNECTING ]
 - [ 1546594388, RECOVER ]
 - [ 1546594388, FULL ]

mgc.MGC10.149.0.183@o2ib.state=
current_state: FULL
state_history:
 - [ 1546444756, CONNECTING ]
 - [ 1546444756, FULL ]
 <...>
 - [ 1546594354, DISCONN ]
 - [ 1546594363, CONNECTING ]
 - [ 1546594363, EVICTED ]
 - [ 1546594363, RECOVER ]   - [ 1546594363, FULL ]
osc.scratch-OST0000-osc-ffff8ecec4bf2800.state=
current_state: CONNECTING
state_history:
 - [ 1548316546, DISCONN ]
 - [ 1548316551, CONNECTING ]
 <...>
 - [ 1548316851, CONNECTING ]
 - [ 1548316921, DISCONN ]
 - [ 1548316926, CONNECTING ]
 - [ 1548316996, DISCONN ]
 - [ 1548317001, CONNECTING ]
 - [ 1548317071, DISCONN ]
 - [ 1548317076, CONNECTING ] 

and so on for all the affected by the file transfer OSTs.

 

In the OSS's logs there are all sorts of connectivity problems reported both by Lustre:

ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie 13579970295878280357
ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous similar messages
ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.149.255.238@o2ib  ns: filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc: 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid: 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849 timeout: 3689597 lvb_type: 0
ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar messages
ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to /tmp/lustre-log.1548315195.186010
ods01: Jan 24 08:44:41 ods01 kernel: Lustre: 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548314995/real 1548315881]  req@ffff88975cc9d100 x1619663992901664/t0(0) o104->scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1

 

 and LNet:

ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous similar messages
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with 10.149.255.254@o2ib (52): c: 
0, oc: 0, rc: 63
ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous similar messages

while `opareport` doesn't detect any Omni-Path errors at all.

 

From the client side:

Jan 24 05:43:03 admin kernel: LustreError: 133-1: scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from 10.149.0.185@o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4
Jan 24 05:43:03 admin kernel: LustreError: 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8ecb19cd0000 x1621600181397872/t0(0) o3->scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860

 
and on another client:

Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1548316966/real 1548316966]  req@ffff995a6dab4e00 x1623278997809600/t0(0) o101->scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous similar messages

 

 

Kind regards,

Konstantin



 Comments   
Comment by cv_eng (Inactive) [ 25/Jan/19 ]

The problem stays when using different client nodes and OSTs (attached to different OSSs) as well.

Also reproducible with the mdtest benchmark, at 1 hour into the test:

 

-- started at 01/24/2019 08:24:19 --
mdtest-1.9.3 was launched with 96 total task(s) on 3 node(s)
Command line used: /home/kku/iotests/benchs/ior-mdtest/src/mdtest "-i" "3" "-I" "128" "-z" "3" "-b" "5" "-u" "-w" "512k" "-d" "/scratch/mdtest2/test"
Path: /scratch/mdtest2
FS: 1753.2 TiB Used FS: 6.0% Inodes: 911.7 Mi Used Inodes: 8.0%
96 tasks, 1916928 files/directories
01/24/2019 09:28:49: Process 9: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 11: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 19: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 23: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 25: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 1: FAILED in mdtest_stat, unable to stat file: Input/output error
01/24/2019 09:28:49: Process 24: FAILED in mdtest_stat, unable to stat file: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

 

Kind regards,

Konstantin

Comment by Peter Jones [ 25/Jan/19 ]

Hi there

Is this a test system that you are experimenting with or a customer deployment? If the latter, which site?

Peter

Comment by cv_eng (Inactive) [ 25/Jan/19 ]

Hi Peter,

 

that's a production system without a maintenance contract.

 

Konstantin

Comment by Peter Jones [ 25/Jan/19 ]

ok then you are better off enquiring to the mailing lists than the community issue tracker

Generated at Sat Feb 10 02:47:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.