[LU-157] metabench failed on parallel-scale test Created: 24/Mar/11 Updated: 20/May/11 Resolved: 20/May/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Michael MacDonald (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
separated MDS and OSS, 3 clients |
||
| Severity: | 3 |
| Rank (Obsolete): | 10083 |
| Description |
|
metabench test failed on lustre client, can be reproduced. test log |
| Comments |
| Comment by Peter Jones [ 25/Mar/11 ] |
|
Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks. |
| Comment by Peter Jones [ 25/Mar/11 ] |
|
Apparently Fan Yong is working on this one |
| Comment by Sarah Liu [ 25/Mar/11 ] |
|
it seems this is a duplicate issue with |
| Comment by Sarah Liu [ 25/Mar/11 ] |
|
miss closing, reopen.this is not a duplicated with 161/142 |
| Comment by Peter Jones [ 26/Mar/11 ] |
|
Fan Yong Are you able to look into this one or should I reassign it? Peter |
| Comment by nasf (Inactive) [ 27/Mar/11 ] |
|
I will investigate it. |
| Comment by nasf (Inactive) [ 29/Mar/11 ] |
|
After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control. Thanks yujian for the help of building test environment. |
| Comment by Peter Jones [ 29/Mar/11 ] |
|
thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config |
| Comment by Michael MacDonald (Inactive) [ 29/Mar/11 ] |
|
fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly. |
| Comment by nasf (Inactive) [ 29/Mar/11 ] |
|
Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench. |
| Comment by Jian Yu [ 25/Apr/11 ] |
|
Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8) # rpm -qf /usr/lib64/openmpi/bin/mpirun # rpm -qf /usr/bin/metabench The same failure occurred while running metabench test: [client-13:9663] *** An error occurred in MPI_Gather [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0 [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9663 on node client-13 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af |
| Comment by Michael MacDonald (Inactive) [ 13/May/11 ] |
|
The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error. I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem! For reference, here is an excerpt from the patch: MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG, - count_buf,1,MPI_INT,proc0,*my_comm)); + count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm)); I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes. |
| Comment by Sarah Liu [ 15/May/11 ] |
|
Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi? |
| Comment by Sarah Liu [ 18/May/11 ] |
|
I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking, Here are both results: |
| Comment by Michael MacDonald (Inactive) [ 20/May/11 ] |
|
I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw). |