[LU-11818] ERROR in aiori-POSIX.c (line 256): transfer failed Created: 20/Dec/18 Updated: 10/Apr/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Running soak over 15 hours, no hard crash, but many applicaitons failed IOR testing hit many failures like: Summary:
api = POSIX
test filename = /mnt/soaked/soaktest/test/iorssf/219922/ssf
access = single-shared-file
pattern = segmented (1 segment)
ordering in a file = sequential offsets
ordering inter file=random task offsets >= 1, seed=0
clients = 23 (2 per node)
repetitions = 1
xfersize = 31.49 MiB
blocksize = 27.34 GiB
aggregate filesize = 628.83 GiB
ParseCommandLine: unknown option `--'.
task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
WARNING: Task 1 requested transfer of 33021952 bytes,
but transferred 7806976 bytes at offset 29356515328
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes
Only transferred 7806976 of 33021952 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
ParseCommandLine: unknown option `--'.
task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
WARNING: Task 3 requested transfer of 33021952 bytes,
but transferred 15032320 bytes at offset 88069545984
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-43: task 22: Killed
srun: Terminating job step 219922.0
WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes
Only transferred 15032320 of 33021952 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
For mdtest, also hit many failures like /mnt/soaked/soaktest/test/mdtestfpp/220334
lcm_layout_gen: 0
lcm_mirror_count: 1
lcm_entry_count: 2
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 0
lcme_extent.e_end: 1048576
stripe_count: 0 stripe_size: 1048576 pattern: mdt stripe_offset: -1
lcme_id: N/A
lcme_mirror_id: N/A
lcme_flags: 0
lcme_extent.e_start: 1048576
lcme_extent.e_end: EOF
stripe_count: -1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
mdtidx FID[seq:oid:ver]
3 [0x2c001aa0c:0x1fd24:0x0]
0 [0x20001d918:0x1fd22:0x0]
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node.
srun: error: soak-22: task 5: Exited with exit code 2
srun: Terminating job step 220334.0
srun: Job step 220334.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 ***
srun: error: soak-40: task 18: Exited with exit code 2
srun: error: soak-20: task 3: Killed
On OSS, found following after OSS restart Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root. Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28. Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root. Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0 Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0 Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key |
| Comments |
| Comment by James Nunez (Inactive) [ 20/Dec/18 ] |
|
From Sarah: Command line used: /mnt/soaked/bin/IOR -vvv -a POSIX -F -w -W -r -Z -q-R -W -b 3043736k -t 7534k -o /mnt/soaked/soaktest/test/iorfpp/219975/iorfpp_file |
| Comment by Peter Jones [ 21/Dec/18 ] |
|
Bobijam The current theory is that this may be a PFL issue Peter |
| Comment by Andreas Dilger [ 21/Dec/18 ] |
|
Sarah wrote:
My theory is that this happens when the client is writing to the file under load and needs to instantiate the PFL components later in the file. Something like the client is in the middle of an MDS layout intent RPC to instantiate a component, and the MDS is waiting for an offline OST to create an object, then a second OST sends a lock callback, the client may be blocked holding an extent lock with dirty data that it can't flush while it is waiting on the layout lock. I think what is needed to debug this is some dlmtrace and RPC Tracy logs from the client, OSS, and MDS, with dump_on_eviction and dump_on_timeout set, and a larger debug buffer to hold enough logs. We may need to get a crash dump of the client at the time of the eviction to see what it is doing. Then, when the client is evicted, trace the OST DLM callback to the client, and see why it isn't cancelling the lock in time. |
| Comment by Sarah Liu [ 21/Dec/18 ] |
|
A quick update, in the latest run(no PFL/DoM), I also fixed the IOR parameter, add the missing space. And this is the wiki page of soak quick start |
| Comment by Sarah Liu [ 15/Jan/19 ] |
|
Move soak to lustre-master-ib build #183, with only regular stripes (no PFL nor DoM). It ran over 3 days, no hard crash, but saw similar IOR, mdtest failures, the fail rate is about 50% |
| Comment by Andreas Dilger [ 07/Feb/19 ] |
|
Sarah, are the IOR and mdtest failures related to MDS/OSS failover, or are there errors even when there are no failovers active? Are the errors always related to ldlm timeouts/eviction? The mdtest errors look to be related to a configuration problem and not related to the filesystem. |
| Comment by Sarah Liu [ 07/Feb/19 ] |
|
Andreas, I just opened To see if this issue is only related to failover, I need to try disable failover and run again |