Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-157

metabench failed on parallel-scale test

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.1.0, Lustre 1.8.6
    • Lustre 2.1.0, Lustre 1.8.6
    • None
    • separated MDS and OSS, 3 clients
    • 3
    • 10083

    Description

      metabench test failed on lustre client, can be reproduced.

      test log
      -----------
      [03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
      [03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
      Removed 10000 files in 8.325 seconds
      [client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
      [client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
      FROM 0
      [client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
      [client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
      w abort)
      --------------------------------------------------------------------------
      mpirun has exited due to process rank 0 with PID 6909 on
      node client-5.lab.whamcloud.com exiting without calling "finalize". This may
      have caused other processes in the application to be
      terminated by signals sent by mpirun (as reported here).
      --------------------------------------------------------------------------

      Attachments

        Issue Links

          Activity

            [LU-157] metabench failed on parallel-scale test

            Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

            1) install openmpi-devel on your test node
            2) compile metabench from source code.
            3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
            4) fix the metabench.c to ignore such unknown parameter "?", and recompile
            5) then run parallel-scale again, it can pass.

            I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            yong.fan nasf (Inactive) added a comment - Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node 2) compile metabench from source code. 3) run parallel-scale with new metabench, it will report "Invalid Arg ?" 4) fix the metabench.c to ignore such unknown parameter "?", and recompile 5) then run parallel-scale again, it can pass. I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

            fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.

            mjmac Michael MacDonald (Inactive) added a comment - fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.
            pjones Peter Jones added a comment -

            thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config

            pjones Peter Jones added a comment - thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config
            yong.fan nasf (Inactive) added a comment - - edited

            After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control.

            Thanks yujian for the help of building test environment.

            yong.fan nasf (Inactive) added a comment - - edited After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control. Thanks yujian for the help of building test environment.

            I will investigate it.

            yong.fan nasf (Inactive) added a comment - I will investigate it.
            pjones Peter Jones added a comment -

            Fan Yong

            Are you able to look into this one or should I reassign it?

            Peter

            pjones Peter Jones added a comment - Fan Yong Are you able to look into this one or should I reassign it? Peter
            sarah Sarah Liu added a comment -

            miss closing, reopen.this is not a duplicated with 161/142

            sarah Sarah Liu added a comment - miss closing, reopen.this is not a duplicated with 161/142
            sarah Sarah Liu added a comment -

            it seems this is a duplicate issue with LU-142, so close it.

            sarah Sarah Liu added a comment - it seems this is a duplicate issue with LU-142 , so close it.
            pjones Peter Jones added a comment -

            Apparently Fan Yong is working on this one

            pjones Peter Jones added a comment - Apparently Fan Yong is working on this one
            pjones Peter Jones added a comment -

            Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks.

            pjones Peter Jones added a comment - Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks.

            People

              mjmac Michael MacDonald (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: