Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12315

parallel-scale test write_disjoint_tiny fails with 'rank N, loop 0: error stating /mnt/lustre/d0.write_disjoint/file: Input/output error'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale test_write_disjoint_tiny fails with 'write_disjoint failed! 1 '.

      Looking at the suite_log from https://testing.whamcloud.com/test_sets/5c85dfd6-7980-11e9-869c-52540065bddc, we see the error

      random seed: 1558184945
      loop 0: chunk_size 11263
      rank 1, loop 0: error stating /mnt/lustre/d0.write_disjoint/file: Input/output error
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
      with errorcode -1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      [trevis-49vm3.trevis.whamcloud.com][[48158,1],0][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      [trevis-49vm4.trevis.whamcloud.com][[48158,1],3][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      --------------------------------------------------------------------------
      mpirun has exited due to process rank 1 with PID 2742 on
      node trevis-49vm4 exiting improperly. There are two reasons this could occur:
      
      1. this process did not call "init" before exiting, but others in
      the job did. This can cause a job to hang indefinitely while it waits
      for all processes to call "init". By rule, if one process calls "init",
      then ALL processes must call "init" prior to termination.
      
      2. this process called "init", but exited without calling "finalize".
      By rule, all processes that call "init" MUST call "finalize" prior to
      exiting or it will be considered an "abnormal termination"
      
      This may have caused other processes in the application to be
      terminated by signals sent by mpirun (as reported here).
      --------------------------------------------------------------------------
      [trevis-49vm4.trevis.whamcloud.com][[48158,1],5][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
       parallel-scale test_write_disjoint_tiny: @@@@@@ FAIL: write_disjoint failed! 1 
      

      Looking at the client 2 console log, we see the error

      [128950.499985] Lustre: DEBUG MARKER: == parallel-scale test write_disjoint_tiny: write_disjoint_tiny ====================================== 13:09:04 (1558184944)
      [128951.005270] LustreError: 17740:0:(client.c:1168:ptlrpc_import_delay_req()) @@@ Uninitialized import.  req@ffff906f886fc900 x1633800442682688/t0(0) o101->lustre-OST0001-osc-ffff906fe3280800@10.9.6.59@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [128951.312994] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale test_write_disjoint_tiny: @@@@@@ FAIL: write_disjoint failed! 1 
      

      There is nothing out of the ordinary in any other console log.

      We've seen this test fail with this error just a few times:
      https://testing.whamcloud.com/test_sets/864f7288-3e3d-11e9-9646-52540065bddc
      https://testing.whamcloud.com/test_sets/848d6ae4-68dc-11e9-a6f2-52540065bddc
      https://testing.whamcloud.com/test_sets/919aecce-6bb7-11e9-aeec-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: