[LU-11886] OSTs and MDTs become unreachable under load, 2.11 Created: 25/Jan/19 Updated: 25/Jan/19 Resolved: 25/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | cv_eng (Inactive) | Assignee: | Peter Jones |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.11.0, ZFS 0.7.9, CentOS 7.5, Omni-Path |
||
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi everyone, mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state= current_state: FULL state_history: - [ 1546444758, CONNECTING ] - [ 1546444758, FULL ] <...> - [ 1546594365, DISCONN ] - [ 1546594388, CONNECTING ] - [ 1546594388, RECOVER ] - [ 1546594388, FULL ] mgc.MGC10.149.0.183@o2ib.state= current_state: FULL state_history: - [ 1546444756, CONNECTING ] - [ 1546444756, FULL ] <...> - [ 1546594354, DISCONN ] - [ 1546594363, CONNECTING ] - [ 1546594363, EVICTED ] - [ 1546594363, RECOVER ] - [ 1546594363, FULL ] osc.scratch-OST0000-osc-ffff8ecec4bf2800.state= current_state: CONNECTING state_history: - [ 1548316546, DISCONN ] - [ 1548316551, CONNECTING ] <...> - [ 1548316851, CONNECTING ] - [ 1548316921, DISCONN ] - [ 1548316926, CONNECTING ] - [ 1548316996, DISCONN ] - [ 1548317001, CONNECTING ] - [ 1548317071, DISCONN ] - [ 1548317076, CONNECTING ] and so on for all the affected by the file transfer OSTs.
In the OSS's logs there are all sorts of connectivity problems reported both by Lustre: ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie 13579970295878280357 ods01: Jan 24 08:30:17 ods01 kernel: LustreError: 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous similar messages ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.149.255.238@o2ib ns: filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc: 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid: 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849 timeout: 3689597 lvb_type: 0 ods01: Jan 24 08:32:25 ods01 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar messages ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to /tmp/lustre-log.1548315195.186010 ods01: Jan 24 08:44:41 ods01 kernel: Lustre: 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548314995/real 1548315881] req@ffff88975cc9d100 x1619663992901664/t0(0) o104->scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1
and LNet: ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 1 seconds ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous similar messages ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with 10.149.255.254@o2ib (52): c: 0, oc: 0, rc: 63 ods01: Jan 24 05:49:26 ods01 kernel: LNetError: 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous similar messages while `opareport` doesn't detect any Omni-Path errors at all.
From the client side: Jan 24 05:43:03 admin kernel: LustreError: 133-1: scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from 10.149.0.185@o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4 Jan 24 05:43:03 admin kernel: LustreError: 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8ecb19cd0000 x1621600181397872/t0(0) o3->scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860 Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1548316966/real 1548316966] req@ffff995a6dab4e00 x1623278997809600/t0(0) o101->scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jan 24 09:03:31 c055 kernel: Lustre: 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous similar messages
Kind regards, Konstantin |
| Comments |
| Comment by cv_eng (Inactive) [ 25/Jan/19 ] |
|
The problem stays when using different client nodes and OSTs (attached to different OSSs) as well. Also reproducible with the mdtest benchmark, at 1 hour into the test:
-- started at 01/24/2019 08:24:19 -- mdtest-1.9.3 was launched with 96 total task(s) on 3 node(s) Command line used: /home/kku/iotests/benchs/ior-mdtest/src/mdtest "-i" "3" "-I" "128" "-z" "3" "-b" "5" "-u" "-w" "512k" "-d" "/scratch/mdtest2/test" Path: /scratch/mdtest2 FS: 1753.2 TiB Used FS: 6.0% Inodes: 911.7 Mi Used Inodes: 8.0% 96 tasks, 1916928 files/directories 01/24/2019 09:28:49: Process 9: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 11: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 19: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 23: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 25: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 1: FAILED in mdtest_stat, unable to stat file: Input/output error 01/24/2019 09:28:49: Process 24: FAILED in mdtest_stat, unable to stat file: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 3 CREATE FROM 0 with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------
Kind regards, Konstantin |
| Comment by Peter Jones [ 25/Jan/19 ] |
|
Hi there Is this a test system that you are experimenting with or a customer deployment? If the latter, which site? Peter |
| Comment by cv_eng (Inactive) [ 25/Jan/19 ] |
|
Hi Peter,
that's a production system without a maintenance contract.
– Konstantin |
| Comment by Peter Jones [ 25/Jan/19 ] |
|
ok then you are better off enquiring to the mailing lists than the community issue tracker |