Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18393

performance-sanity test_4: FAIL: test_4 failed with 1

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/06d7916c-1d43-4976-ae97-a956a99f9115

      test_4 failed with the following error:

      + su mpiuser bash -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -machinefile /tmp/auster.machines -np 1 -npernode 2 /usr/lib64/openmpi/bin/mdtest -d=/mnt/lustre/mdtest -I=1488 -n=148893 -r "
      onyx-41vm1.onyx.whamcloud.com:rank0.mdtest: Failed to get eth0 (unit 0) cpu set
      onyx-41vm1.onyx.whamcloud.com:rank0: PSM3 can't open nic unit: 0 (err=23)
      --------------------------------------------------------------------------
      Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
      unusual; your job may behave unpredictably (and/or abort) after this.
      
        Local host: onyx-41vm1
        Location: mtl_ofi_component.c:513
        Error: Invalid argument (22)
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-427.31.1.el9_4.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-427.31.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      performance-sanity test_4 - test_4 failed with 1

      Attachments

        Issue Links

          Activity

            [LU-18393] performance-sanity test_4: FAIL: test_4 failed with 1
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56785/
            Subject: LU-18393 tests: $num_files should be multiple of $num_entries
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5936b1cb17ffec5fba384bba2c915e93657862da

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56785/ Subject: LU-18393 tests: $num_files should be multiple of $num_entries Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5936b1cb17ffec5fba384bba2c915e93657862da
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/1d904ddb-50a9-4bc7-8ee7-d13c9fbf76c7

            "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56785
            Subject: LU-18393 tests: $num_files should be multiple of $num_entries
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 77bbabeaa72bbd1aac36fe1e3c8df8a16cb8237f

            gerrit Gerrit Updater added a comment - "Emoly Liu <emoly@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56785 Subject: LU-18393 tests: $num_files should be multiple of $num_entries Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 77bbabeaa72bbd1aac36fe1e3c8df8a16cb8237f
            emoly.liu Emoly Liu added a comment -

            This failure happened because the number of available inodes is not enough, which causes the items can't be a multiple of items per directory. For example, 157824 vs. 1578 in the following logs.

            Directory lookup rate 100 directories, 2000 files each
            change the number of files 200000 to the number of available inodes 157824
            split 157824 files to 100 with 1578 files each
            ...
            Command line used: /usr/lib64/openmpi/bin/mdtest '-d=/mnt/lustre/mdtest' '-I=1578' '-n=157824' '-C' '-D' '-E' '-k' '-r'
            10/22/2024 22:13:14: Process 0: FAILED in md_validate_tests, items must be a multiple of items per directory
            --------------------------------------------------------------------------
            MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
            with errorcode 1.
            

             I will push a patch to fix this multiple.

            emoly.liu Emoly Liu added a comment - This failure happened because the number of available inodes is not enough, which causes the items can't be a multiple of items per directory. For example, 157824 vs. 1578 in the following logs. Directory lookup rate 100 directories, 2000 files each change the number of files 200000 to the number of available inodes 157824 split 157824 files to 100 with 1578 files each ... Command line used: /usr/lib64/openmpi/bin/mdtest '-d=/mnt/lustre/mdtest' '-I=1578' '-n=157824' '-C' '-D' '-E' '-k' '-r' 10/22/2024 22:13:14: Process 0: FAILED in md_validate_tests, items must be a multiple of items per directory -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.  I will push a patch to fix this multiple.
            lixi_wc Li Xi added a comment -

            emoly.liu Would you please check this issue?

            lixi_wc Li Xi added a comment - emoly.liu Would you please check this issue?
            yujian Jian Yu added a comment -

            The failure occurred 74 times in the past 6 months.

            yujian Jian Yu added a comment - The failure occurred 74 times in the past 6 months.

            People

              emoly.liu Emoly Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: