Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0
-
soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6
-
3
-
9223372036854775807
Description
Running soak over 15 hours, no hard crash, but many applicaitons failed
IOR testing hit many failures like:
IOR version: IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Summary: api = POSIX test filename = /mnt/soaked/soaktest/test/iorssf/219922/ssf access = single-shared-file pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=random task offsets >= 1, seed=0 clients = 23 (2 per node) repetitions = 1 xfersize = 31.49 MiB blocksize = 27.34 GiB aggregate filesize = 628.83 GiB ParseCommandLine: unknown option `--'. task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf WARNING: Task 1 requested transfer of 33021952 bytes, but transferred 7806976 bytes at offset 29356515328 WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272). WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes Only transferred 7806976 of 33021952 bytes ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error ** exiting ** -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode -1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- ParseCommandLine: unknown option `--'. task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf WARNING: Task 3 requested transfer of 33021952 bytes, but transferred 15032320 bytes at offset 88069545984 WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272). slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: soak-43: task 22: Killed srun: Terminating job step 219922.0 WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes Only transferred 15032320 of 33021952 bytes ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error
For mdtest, also hit many failures like
/mnt/soaked/soaktest/test/mdtestfpp/220334 lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 stripe_count: 0 stripe_size: 1048576 pattern: mdt stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF stripe_count: -1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 3 [0x2c001aa0c:0x1fd24:0x0] 0 [0x20001d918:0x1fd22:0x0] lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64 srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node. srun: error: soak-22: task 5: Exited with exit code 2 srun: Terminating job step 220334.0 srun: Job step 220334.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 *** srun: error: soak-40: task 18: Exited with exit code 2 srun: error: soak-20: task 3: Killed
On OSS, found following after OSS restart
Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root. Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28. Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root. Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0 Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0 Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key