[LU-15457] IOR MPIIO job abort - file handling issue (EAGAIN) Created: 18/Jan/22 Updated: 27/Apr/22 Resolved: 27/Apr/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Prasannakumar Nagasubramani | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client cluster:Walleye-P5 20 x Lustre 2.15 clients 4 X LNet routers Lustre client: 2.14.56_78_ga7e1d9f storage cluster: kjcf05 NEO build: 6.0-010.14-cm-22.01.12-ga7e1d9f Lustre server: 2.14.56_78_ga7e1d9f Model SSUs 1xE1000D 1xE1000F 4 x OSS nodes with 2 x HDD OSTs and 2 x flash OSTs |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
MPIIO job abort due to file handling issue. In current 24 hours FOFB run with Regression write 582.27 524288 1024.00 0.005614 56.27 0.000848 56.28 3 XXCEL Verifying contents of the file(s) just written. Mon Jan 10 08:15:21 2022delaying 1 seconds . . . ** error ** ** error ** ** error ** ** error ** ** error ** ** error ** ** error ** ** error ** ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ERROR in aiori-MPIIO.c (line 128): cannot open file. ** exiting ** Rank 1 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1 Rank 7 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7 Rank 2 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2 Rank 4 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4 Rank 6 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 6 Rank 5 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 5 Rank 3 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3 MPI No MPI error MPI No MPI error MPI No MPI error ** exiting ** Test tag summary : aptrun -n 64 -N 8 /cray/css/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m -w -r -W -i 8 -t 1m -a MPIIO -b 512m -C -k -u -vv -q -d 1 -c Summary: api = MPIIO (version=3, subversion=1) test filename = /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m access = single-shared-file, collective pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients = 64 (8 per node) repetitions = 8 xfersize = 1 MiB blocksize = 512 MiB aggregate filesize = 32 GiB I Will upload the dk logs and other logs from lustre nodes to FTP. |
| Comments |
| Comment by Prasannakumar Nagasubramani [ 18/Jan/22 ] |
|
Logs are uploaded to ftp :
ftp> pwd
257 "/uploads/LU15457"
logs uploaded: total 855668 -rw-r--r-- 1 guest users 1067164 Jan 10 10:04 console-20220110 -rw-r--r-- 1 guest users 12973306 Jan 10 10:15 dklog.c3-0c0s10n2.log -rw------- 1 guest users 2726490 Jan 10 09:59 dklog.kjcf05n02.20220110104556.log -rw------- 1 guest users 1677197 Jan 10 09:59 dklog.kjcf05n03.20220110104556.log -rw------- 1 guest users 142662635 Jan 10 09:59 dklog.kjcf05n04.20220110104556.log -rw------- 1 guest users 143533831 Jan 10 09:59 dklog.kjcf05n05.20220110104556.log -rw------- 1 guest users 144005713 Jan 10 09:59 dklog.kjcf05n06.20220110104556.log -rw------- 1 guest users 144228205 Jan 10 09:59 dklog.kjcf05n07.20220110104556.log -rw------- 1 guest users 2166765 Jan 10 10:01 ha.log -rw------- 1 guest users 136543876 Jan 10 10:00 kern -rw-r--r-- 1 guest users 539815 Jan 10 10:02 lustre-failover-log_202201100814 -rw------- 1 guest users 125377316 Jan 10 10:00 messages -rw-r--r-- 1 guest users 18650793 Jan 10 10:04 messages-20220110 -rw-rw-r-- 1 guest users 29755 Jan 10 10:06 tag_output.txt |