Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15457

IOR MPIIO job abort - file handling issue (EAGAIN)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      MPIIO job abort due to file handling issue. In current 24 hours FOFB run with Regression
      Write Verify Dne2 DOM SEL OVS.

      write     582.27     524288     1024.00    0.005614   56.27      0.000848   56.28      3    XXCEL
      Verifying contents of the file(s) just written.
      Mon Jan 10 08:15:21 2022delaying 1 seconds . . .
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ** exiting **
      Rank 1 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
      Rank 7 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7
      Rank 2 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
      Rank 4 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
      Rank 6 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 6
      Rank 5 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 5
      Rank 3 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
      MPI No MPI error
      MPI No MPI error
      MPI No MPI error
      ** exiting ** 

      Test tag summary :

      aptrun -n 64 -N 8 /cray/css/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m -w -r -W -i 8 -t 1m -a MPIIO -b 512m -C -k -u -vv -q -d 1 -c  
      Summary:
      	api                = MPIIO (version=3, subversion=1)
      	test filename      = /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m
      	access             = single-shared-file, collective
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 64 (8 per node)
      	repetitions        = 8
      	xfersize           = 1 MiB
      	blocksize          = 512 MiB
      	aggregate filesize = 32 GiB  

      I Will upload the dk logs and other logs from lustre nodes to FTP. 

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: