[LU-157] metabench failed on parallel-scale test - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Labels:
None
Environment:
separated MDS and OSS, 3 clients

Severity:
3
Rank (Obsolete):
10083

Description

metabench test failed on lustre client, can be reproduced.

test log
-----------
[03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
[03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
Removed 10000 files in 8.325 seconds
[client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
[client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
FROM 0
[client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
[client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
w abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6909 on
node client-5.lab.whamcloud.com exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Attachments

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-157] metabench failed on parallel-scale test

nasf (Inactive) added a comment - 29/Mar/11 8:18 PM

Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

1) install openmpi-devel on your test node
2) compile metabench from source code.
3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
4) fix the metabench.c to ignore such unknown parameter "?", and recompile
5) then run parallel-scale again, it can pass.

I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

nasf (Inactive) added a comment - 29/Mar/11 8:18 PM Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node 2) compile metabench from source code. 3) run parallel-scale with new metabench, it will report "Invalid Arg ?" 4) fix the metabench.c to ignore such unknown parameter "?", and recompile 5) then run parallel-scale again, it can pass. I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

Michael MacDonald (Inactive) added a comment - 29/Mar/11 9:25 AM

fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.

Michael MacDonald (Inactive) added a comment - 29/Mar/11 9:25 AM fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.

Peter Jones added a comment - 29/Mar/11 9:22 AM

thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config

Peter Jones added a comment - 29/Mar/11 9:22 AM thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config

nasf (Inactive) added a comment - 29/Mar/11 9:04 AM - edited

After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control.

Thanks yujian for the help of building test environment.

nasf (Inactive) added a comment - 29/Mar/11 9:04 AM - edited After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control. Thanks yujian for the help of building test environment.

nasf (Inactive) added a comment - 27/Mar/11 3:18 AM

I will investigate it.

nasf (Inactive) added a comment - 27/Mar/11 3:18 AM I will investigate it.

Peter Jones added a comment - 26/Mar/11 4:52 AM

Fan Yong

Are you able to look into this one or should I reassign it?

Peter

Peter Jones added a comment - 26/Mar/11 4:52 AM Fan Yong Are you able to look into this one or should I reassign it? Peter

Sarah Liu added a comment - 25/Mar/11 1:06 PM

miss closing, reopen.this is not a duplicated with 161/142

Sarah Liu added a comment - 25/Mar/11 1:06 PM miss closing, reopen.this is not a duplicated with 161/142

Sarah Liu added a comment - 25/Mar/11 11:28 AM

it seems this is a duplicate issue with ~~LU-142~~, so close it.

Sarah Liu added a comment - 25/Mar/11 11:28 AM it seems this is a duplicate issue with LU-142 , so close it.

Peter Jones added a comment - 25/Mar/11 7:15 AM

Apparently Fan Yong is working on this one

Peter Jones added a comment - 25/Mar/11 7:15 AM Apparently Fan Yong is working on this one

Peter Jones added a comment - 25/Mar/11 5:22 AM

Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks.

Peter Jones added a comment - 25/Mar/11 5:22 AM Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks.

People

Assignee:: Michael MacDonald (Inactive)

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Mar/11 12:55 PM

Updated:: 20/May/11 4:14 PM

Resolved:: 20/May/11 4:14 PM