Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-157

metabench failed on parallel-scale test

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.1.0, Lustre 1.8.6
    • None
    • separated MDS and OSS, 3 clients
    • 3
    • 10083

    Description

      metabench test failed on lustre client, can be reproduced.

      test log
      -----------
      [03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
      [03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
      Removed 10000 files in 8.325 seconds
      [client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
      [client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
      FROM 0
      [client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
      [client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
      w abort)
      --------------------------------------------------------------------------
      mpirun has exited due to process rank 0 with PID 6909 on
      node client-5.lab.whamcloud.com exiting without calling "finalize". This may
      have caused other processes in the application to be
      terminated by signals sent by mpirun (as reported here).
      --------------------------------------------------------------------------

      Attachments

        Issue Links

          Activity

            [LU-157] metabench failed on parallel-scale test

            I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).

            mjmac Michael MacDonald (Inactive) added a comment - I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).
            sarah Sarah Liu added a comment -

            I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking,LU-344.

            Here are both results:
            https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af
            https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af

            sarah Sarah Liu added a comment - I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking, LU-344 . Here are both results: https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af
            sarah Sarah Liu added a comment - - edited

            Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

            sarah Sarah Liu added a comment - - edited Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

            The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error.

            I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem!

            For reference, here is an excerpt from the patch:

                 MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG,
            -           count_buf,1,MPI_INT,proc0,*my_comm));
            +           count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm));
            

            I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.

            mjmac Michael MacDonald (Inactive) added a comment - The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error. I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem! For reference, here is an excerpt from the patch: MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG, - count_buf,1,MPI_INT,proc0,*my_comm)); + count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm)); I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.
            yujian Jian Yu added a comment -

            Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8)
            Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server)
            Network: tcp

            # rpm -qf /usr/lib64/openmpi/bin/mpirun
            openmpi-1.4.1-4.3.el6.x86_64

            # rpm -qf /usr/bin/metabench
            metabench-1.0-1.wc1.x86_64

            The same failure occurred while running metabench test:

            [client-13:9663] *** An error occurred in MPI_Gather
            [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0
            [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated
            [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
            --------------------------------------------------------------------------
            mpirun has exited due to process rank 0 with PID 9663 on
            node client-13 exiting without calling "finalize". This may
            have caused other processes in the application to be
            terminated by signals sent by mpirun (as reported here).
            --------------------------------------------------------------------------
            

            Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

            yujian Jian Yu added a comment - Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8) Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server) Network: tcp # rpm -qf /usr/lib64/openmpi/bin/mpirun openmpi-1.4.1-4.3.el6.x86_64 # rpm -qf /usr/bin/metabench metabench-1.0-1.wc1.x86_64 The same failure occurred while running metabench test: [client-13:9663] *** An error occurred in MPI_Gather [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0 [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9663 on node client-13 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

            Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

            1) install openmpi-devel on your test node
            2) compile metabench from source code.
            3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
            4) fix the metabench.c to ignore such unknown parameter "?", and recompile
            5) then run parallel-scale again, it can pass.

            I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            yong.fan nasf (Inactive) added a comment - Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node 2) compile metabench from source code. 3) run parallel-scale with new metabench, it will report "Invalid Arg ?" 4) fix the metabench.c to ignore such unknown parameter "?", and recompile 5) then run parallel-scale again, it can pass. I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            People

              mjmac Michael MacDonald (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: