Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14292

metadata-updates test 3 fails with 'mpi_run failed'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      metadata-updates test_3 fails with 'mpi_run failed' starting on 2020-08-08 for for Lustre 2.13.55.9 full-patchless test session at https://testing.whamcloud.com/test_sets/abe45967-5b59-4042-88eb-ff34e5e658ad. Since that time, metadata-updates test 3 has failed 27 times with the latest failure at https://testing.whamcloud.com/test_sets/f550175e-d2d7-4ba2-a9fc-0316726010be.

      Looking at the failure in the suite_log, we see that the disk quota was exceeded

      == metadata-updates test 3: write_disjoint test ====================================================== 07:33:16 (1608881596)
      + chmod 0777 /mnt/lustre
      drwxrwxrwx 6 root root 65536 Dec 25 07:32 /mnt/lustre
      + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -machinefile /tmp/auster.machines -np 2 /usr/lib64/openmpi/bin/write_disjoint -f /mnt/lustre/d0.metadata-updates/f3.metadata-updates -n 1000 "
      Warning: Permanently added 'trevis-202vm2,10.9.7.130' (ECDSA) to the list of known hosts.
      random seed: 1608881598
      loop 0: chunk_size 15794568
      rank 0, loop 8: write() returned Disk quota exceeded
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
      with errorcode -1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      ------------------------------------------------------------
      A process or daemon was unable to complete a TCP connection
      to another process:
        Local host:    trevis-202vm1
        Remote host:   trevis-202vm2
      This is usually caused by a firewall on the remote host. Please
      check that any firewall (e.g., iptables) has been disabled and
      try again.
      ------------------------------------------------------------
       metadata-updates test_3: @@@@@@ FAIL: mpi_run failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/metadata-updates.sh:269:test_3()
      

      On client1 (vm1) dmesg, we see

      [111671.914784] Lustre: DEBUG MARKER: == metadata-updates test 3: write_disjoint test ====================================================== 07:33:16 (1608881596)
      [111673.143081] hugetlbfs: write_disjoint (1566296): Using mlock ulimits for SHM_HUGETLB is deprecated
      [111678.973977] systemd-coredump[1566303]: Not enough arguments passed by the kernel (0, expected 7).
      [111679.381003] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  metadata-updates test_3: @@@@@@ FAIL: mpi_run failed 
      

      There’s nothing else that indicates a problem in the console and dmesg logs.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: