[LU-15457] IOR MPIIO job abort - file handling issue (EAGAIN) Created: 18/Jan/22  Updated: 27/Apr/22  Resolved: 27/Apr/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Prasannakumar Nagasubramani Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client cluster:Walleye-P5

20 x Lustre 2.15 clients

4 X LNet routers

Lustre client:

2.14.56_78_ga7e1d9f

storage cluster: kjcf05

NEO build: 6.0-010.14-cm-22.01.12-ga7e1d9f

Lustre server:

2.14.56_78_ga7e1d9f

Model SSUs

1xE1000D 1xE1000F

4 x OSS nodes with 2 x HDD OSTs and 2 x flash OSTs


Issue Links:
Duplicate
duplicates LU-15788 lazystatfs + FOFB + mpich problems Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MPIIO job abort due to file handling issue. In current 24 hours FOFB run with Regression
Write Verify Dne2 DOM SEL OVS.

write     582.27     524288     1024.00    0.005614   56.27      0.000848   56.28      3    XXCEL
Verifying contents of the file(s) just written.
Mon Jan 10 08:15:21 2022delaying 1 seconds . . .
** error **
** error **
** error **
** error **
** error **
** error **
** error **
** error **
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
ERROR in aiori-MPIIO.c (line 128): cannot open file.
** exiting **
Rank 1 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Rank 7 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7
Rank 2 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
Rank 4 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
Rank 6 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 6
Rank 5 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 5
Rank 3 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
MPI No MPI error
MPI No MPI error
MPI No MPI error
** exiting ** 

Test tag summary :

aptrun -n 64 -N 8 /cray/css/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m -w -r -W -i 8 -t 1m -a MPIIO -b 512m -C -k -u -vv -q -d 1 -c  
Summary:
	api                = MPIIO (version=3, subversion=1)
	test filename      = /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m
	access             = single-shared-file, collective
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file=constant task offsets = 1
	clients            = 64 (8 per node)
	repetitions        = 8
	xfersize           = 1 MiB
	blocksize          = 512 MiB
	aggregate filesize = 32 GiB  

I Will upload the dk logs and other logs from lustre nodes to FTP. 



 Comments   
Comment by Prasannakumar Nagasubramani [ 18/Jan/22 ]

Logs are uploaded to ftp :

ftp> pwd
257 "/uploads/LU15457" 

logs uploaded:

total 855668
-rw-r--r-- 1 guest users   1067164 Jan 10 10:04 console-20220110
-rw-r--r-- 1 guest users  12973306 Jan 10 10:15 dklog.c3-0c0s10n2.log
-rw------- 1 guest users   2726490 Jan 10 09:59 dklog.kjcf05n02.20220110104556.log
-rw------- 1 guest users   1677197 Jan 10 09:59 dklog.kjcf05n03.20220110104556.log
-rw------- 1 guest users 142662635 Jan 10 09:59 dklog.kjcf05n04.20220110104556.log
-rw------- 1 guest users 143533831 Jan 10 09:59 dklog.kjcf05n05.20220110104556.log
-rw------- 1 guest users 144005713 Jan 10 09:59 dklog.kjcf05n06.20220110104556.log
-rw------- 1 guest users 144228205 Jan 10 09:59 dklog.kjcf05n07.20220110104556.log
-rw------- 1 guest users   2166765 Jan 10 10:01 ha.log
-rw------- 1 guest users 136543876 Jan 10 10:00 kern
-rw-r--r-- 1 guest users    539815 Jan 10 10:02 lustre-failover-log_202201100814
-rw------- 1 guest users 125377316 Jan 10 10:00 messages
-rw-r--r-- 1 guest users  18650793 Jan 10 10:04 messages-20220110
-rw-rw-r-- 1 guest users     29755 Jan 10 10:06 tag_output.txt
 
Generated at Sat Feb 10 03:18:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.