Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-157

metabench failed on parallel-scale test

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.1.0, Lustre 1.8.6
    • None
    • separated MDS and OSS, 3 clients
    • 3
    • 10083

    Description

      metabench test failed on lustre client, can be reproduced.

      test log
      -----------
      [03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
      [03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
      Removed 10000 files in 8.325 seconds
      [client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
      [client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
      FROM 0
      [client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
      [client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
      w abort)
      --------------------------------------------------------------------------
      mpirun has exited due to process rank 0 with PID 6909 on
      node client-5.lab.whamcloud.com exiting without calling "finalize". This may
      have caused other processes in the application to be
      terminated by signals sent by mpirun (as reported here).
      --------------------------------------------------------------------------

      Attachments

        Issue Links

          Activity

            [LU-157] metabench failed on parallel-scale test

            I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).

            mjmac Michael MacDonald (Inactive) added a comment - I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).
            sarah Sarah Liu added a comment -

            I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking,LU-344.

            Here are both results:
            https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af
            https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af

            sarah Sarah Liu added a comment - I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking, LU-344 . Here are both results: https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af
            sarah Sarah Liu added a comment - - edited

            Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

            sarah Sarah Liu added a comment - - edited Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

            The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error.

            I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem!

            For reference, here is an excerpt from the patch:

                 MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG,
            -           count_buf,1,MPI_INT,proc0,*my_comm));
            +           count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm));
            

            I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.

            mjmac Michael MacDonald (Inactive) added a comment - The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error. I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem! For reference, here is an excerpt from the patch: MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG, - count_buf,1,MPI_INT,proc0,*my_comm)); + count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm)); I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.
            yujian Jian Yu added a comment -

            Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8)
            Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server)
            Network: tcp

            # rpm -qf /usr/lib64/openmpi/bin/mpirun
            openmpi-1.4.1-4.3.el6.x86_64

            # rpm -qf /usr/bin/metabench
            metabench-1.0-1.wc1.x86_64

            The same failure occurred while running metabench test:

            [client-13:9663] *** An error occurred in MPI_Gather
            [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0
            [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated
            [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
            --------------------------------------------------------------------------
            mpirun has exited due to process rank 0 with PID 9663 on
            node client-13 exiting without calling "finalize". This may
            have caused other processes in the application to be
            terminated by signals sent by mpirun (as reported here).
            --------------------------------------------------------------------------
            

            Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

            yujian Jian Yu added a comment - Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8) Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server) Network: tcp # rpm -qf /usr/lib64/openmpi/bin/mpirun openmpi-1.4.1-4.3.el6.x86_64 # rpm -qf /usr/bin/metabench metabench-1.0-1.wc1.x86_64 The same failure occurred while running metabench test: [client-13:9663] *** An error occurred in MPI_Gather [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0 [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9663 on node client-13 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

            Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

            1) install openmpi-devel on your test node
            2) compile metabench from source code.
            3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
            4) fix the metabench.c to ignore such unknown parameter "?", and recompile
            5) then run parallel-scale again, it can pass.

            I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            yong.fan nasf (Inactive) added a comment - Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node 2) compile metabench from source code. 3) run parallel-scale with new metabench, it will report "Invalid Arg ?" 4) fix the metabench.c to ignore such unknown parameter "?", and recompile 5) then run parallel-scale again, it can pass. I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.

            mjmac Michael MacDonald (Inactive) added a comment - fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.
            pjones Peter Jones added a comment -

            thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config

            pjones Peter Jones added a comment - thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config
            yong.fan nasf (Inactive) added a comment - - edited

            After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control.

            Thanks yujian for the help of building test environment.

            yong.fan nasf (Inactive) added a comment - - edited After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control. Thanks yujian for the help of building test environment.

            I will investigate it.

            yong.fan nasf (Inactive) added a comment - I will investigate it.

            People

              mjmac Michael MacDonald (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: