Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
- soak
Environment:
soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running soak over 15 hours, no hard crash, but many applicaitons failed

IOR testing hit many failures like:
IOR version: IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Summary:
        api                = POSIX
        test filename      = /mnt/soaked/soaktest/test/iorssf/219922/ssf
        access             = single-shared-file
        pattern            = segmented (1 segment)
        ordering in a file = sequential offsets
        ordering inter file=random task offsets >= 1, seed=0
        clients            = 23 (2 per node)
        repetitions        = 1
        xfersize           = 31.49 MiB
        blocksize          = 27.34 GiB
        aggregate filesize = 628.83 GiB

ParseCommandLine: unknown option `--'.
task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
WARNING: Task 1 requested transfer of 33021952 bytes,
         but transferred 7806976 bytes at offset 29356515328
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes
Only transferred 7806976 of 33021952 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
ParseCommandLine: unknown option `--'.
task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
WARNING: Task 3 requested transfer of 33021952 bytes,
         but transferred 15032320 bytes at offset 88069545984
WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: soak-43: task 22: Killed
srun: Terminating job step 219922.0
WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes
Only transferred 15032320 of 33021952 bytes
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error

For mdtest, also hit many failures like

/mnt/soaked/soaktest/test/mdtestfpp/220334
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      stripe_count:  0       stripe_size:   1048576       pattern:       mdt       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      stripe_count:  -1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1

lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
mdtidx           FID[seq:oid:ver]
     3           [0x2c001aa0c:0x1fd24:0x0]              
     0           [0x20001d918:0x1fd22:0x0]              
lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node.
srun: error: soak-22: task 5: Exited with exit code 2
srun: Terminating job step 220334.0
srun: Job step 220334.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: execve(): mdtest: No such file or directory
slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 ***
srun: error: soak-40: task 18: Exited with exit code 2
srun: error: soak-20: task 3: Killed

On OSS, found following after OSS restart

Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root.
Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28.
Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root.
Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages
Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib  ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0
Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages
Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0
Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages
Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib  ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0
Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0
Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages
Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key

Attachments

Issue Links

mentioned in: Page Loading...; Page Loading...

ERROR in aiori-POSIX.c (line 256): transfer failed

Details

Description

Attachments

Issue Links

Activity

People

Dates