Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0
-
soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6
-
3
-
9223372036854775807
Description
Running soak over 15 hours, no hard crash, but many applicaitons failed
IOR testing hit many failures like:
IOR version: IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Summary: api = POSIX test filename = /mnt/soaked/soaktest/test/iorssf/219922/ssf access = single-shared-file pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=random task offsets >= 1, seed=0 clients = 23 (2 per node) repetitions = 1 xfersize = 31.49 MiB blocksize = 27.34 GiB aggregate filesize = 628.83 GiB ParseCommandLine: unknown option `--'. task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf WARNING: Task 1 requested transfer of 33021952 bytes, but transferred 7806976 bytes at offset 29356515328 WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272). WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes Only transferred 7806976 of 33021952 bytes ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error ** exiting ** -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode -1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- ParseCommandLine: unknown option `--'. task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf WARNING: Task 3 requested transfer of 33021952 bytes, but transferred 15032320 bytes at offset 88069545984 WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272). slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: soak-43: task 22: Killed srun: Terminating job step 219922.0 WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes Only transferred 15032320 of 33021952 bytes ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error
For mdtest, also hit many failures like
/mnt/soaked/soaktest/test/mdtestfpp/220334 lcm_layout_gen: 0 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 stripe_count: 0 stripe_size: 1048576 pattern: mdt stripe_offset: -1 lcme_id: N/A lcme_mirror_id: N/A lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF stripe_count: -1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 3 [0x2c001aa0c:0x1fd24:0x0] 0 [0x20001d918:0x1fd22:0x0] lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64 srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node. srun: error: soak-22: task 5: Exited with exit code 2 srun: Terminating job step 220334.0 srun: Job step 220334.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: execve(): mdtest: No such file or directory slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 *** srun: error: soak-40: task 18: Exited with exit code 2 srun: error: soak-20: task 3: Killed
On OSS, found following after OSS restart
Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root. Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28. Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root. Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0 Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib) Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0 Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0 Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Andreas, I just opened
LU-11943to describe the issue I saw on 2.10, which may related to this one. From that ticket, soak ran well for several hours(there were MDS/OSS failover during that time), then application started to fail a lot, check around the time when things went bad and found the OSS lock callback error like this ticket, and the OSS was just finish restart.To see if this issue is only related to failover, I need to try disable failover and run again