Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15457

IOR MPIIO job abort - file handling issue (EAGAIN)

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      MPIIO job abort due to file handling issue. In current 24 hours FOFB run with Regression
      Write Verify Dne2 DOM SEL OVS.

      write     582.27     524288     1024.00    0.005614   56.27      0.000848   56.28      3    XXCEL
      Verifying contents of the file(s) just written.
      Mon Jan 10 08:15:21 2022delaying 1 seconds . . .
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ** error **
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ERROR in aiori-MPIIO.c (line 128): cannot open file.
      ** exiting **
      Rank 1 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
      Rank 7 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 7
      Rank 2 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
      Rank 4 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 4
      Rank 6 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 6
      Rank 5 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 5
      Rank 3 [Mon Jan 10 08:17:46 2022] [c3-0c0s8n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
      MPI No MPI error
      MPI No MPI error
      MPI No MPI error
      ** exiting ** 

      Test tag summary :

      aptrun -n 64 -N 8 /cray/css/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m -w -r -W -i 8 -t 1m -a MPIIO -b 512m -C -k -u -vv -q -d 1 -c  
      Summary:
      	api                = MPIIO (version=3, subversion=1)
      	test filename      = /lus/kjcf05/flash/ostest.vers/alsorun.20220110080202.31077.walleye-p5/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m.3.d0DIRA.1641823358/CL_IOR_pfl_ssf_mpiioc_wr_8iter_n8x8_1m/IORfile_1m
      	access             = single-shared-file, collective
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file=constant task offsets = 1
      	clients            = 64 (8 per node)
      	repetitions        = 8
      	xfersize           = 1 MiB
      	blocksize          = 512 MiB
      	aggregate filesize = 32 GiB  

      I Will upload the dk logs and other logs from lustre nodes to FTP. 

      Attachments

        Issue Links

          Activity

            [LU-15457] IOR MPIIO job abort - file handling issue (EAGAIN)

            Logs are uploaded to ftp :

            ftp> pwd
            257 "/uploads/LU15457" 

            logs uploaded:

            total 855668
            -rw-r--r-- 1 guest users   1067164 Jan 10 10:04 console-20220110
            -rw-r--r-- 1 guest users  12973306 Jan 10 10:15 dklog.c3-0c0s10n2.log
            -rw------- 1 guest users   2726490 Jan 10 09:59 dklog.kjcf05n02.20220110104556.log
            -rw------- 1 guest users   1677197 Jan 10 09:59 dklog.kjcf05n03.20220110104556.log
            -rw------- 1 guest users 142662635 Jan 10 09:59 dklog.kjcf05n04.20220110104556.log
            -rw------- 1 guest users 143533831 Jan 10 09:59 dklog.kjcf05n05.20220110104556.log
            -rw------- 1 guest users 144005713 Jan 10 09:59 dklog.kjcf05n06.20220110104556.log
            -rw------- 1 guest users 144228205 Jan 10 09:59 dklog.kjcf05n07.20220110104556.log
            -rw------- 1 guest users   2166765 Jan 10 10:01 ha.log
            -rw------- 1 guest users 136543876 Jan 10 10:00 kern
            -rw-r--r-- 1 guest users    539815 Jan 10 10:02 lustre-failover-log_202201100814
            -rw------- 1 guest users 125377316 Jan 10 10:00 messages
            -rw-r--r-- 1 guest users  18650793 Jan 10 10:04 messages-20220110
            -rw-rw-r-- 1 guest users     29755 Jan 10 10:06 tag_output.txt
             
            prasannakumar Prasannakumar Nagasubramani added a comment - Logs are uploaded to ftp : ftp> pwd 257 "/uploads/LU15457" logs uploaded: total 855668 -rw-r--r-- 1 guest users   1067164 Jan 10 10:04 console-20220110 -rw-r--r-- 1 guest users  12973306 Jan 10 10:15 dklog.c3-0c0s10n2.log -rw------- 1 guest users   2726490 Jan 10 09:59 dklog.kjcf05n02.20220110104556.log -rw------- 1 guest users   1677197 Jan 10 09:59 dklog.kjcf05n03.20220110104556.log -rw------- 1 guest users 142662635 Jan 10 09:59 dklog.kjcf05n04.20220110104556.log -rw------- 1 guest users 143533831 Jan 10 09:59 dklog.kjcf05n05.20220110104556.log -rw------- 1 guest users 144005713 Jan 10 09:59 dklog.kjcf05n06.20220110104556.log -rw------- 1 guest users 144228205 Jan 10 09:59 dklog.kjcf05n07.20220110104556.log -rw------- 1 guest users   2166765 Jan 10 10:01 ha.log -rw------- 1 guest users 136543876 Jan 10 10:00 kern -rw-r--r-- 1 guest users    539815 Jan 10 10:02 lustre-failover-log_202201100814 -rw------- 1 guest users 125377316 Jan 10 10:00 messages -rw-r--r-- 1 guest users  18650793 Jan 10 10:04 messages-20220110 -rw-rw-r-- 1 guest users     29755 Jan 10 10:06 tag_output.txt

            People

              wc-triage WC Triage
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: