[LU-157] metabench failed on parallel-scale test Created: 24/Mar/11  Updated: 20/May/11  Resolved: 20/May/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Michael MacDonald (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

separated MDS and OSS, 3 clients


Severity: 3
Rank (Obsolete): 10083

 Description   

metabench test failed on lustre client, can be reproduced.

test log
-----------
[03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
[03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
Removed 10000 files in 8.325 seconds
[client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
[client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
FROM 0
[client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
[client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
w abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6909 on
node client-5.lab.whamcloud.com exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------



 Comments   
Comment by Peter Jones [ 25/Mar/11 ]

Ah sorry Yu Jian. This was the ticket that I meant to assign to you when I assigned LU158 to you. Could you please see what you can uncover about the failure? Thanks.

Comment by Peter Jones [ 25/Mar/11 ]

Apparently Fan Yong is working on this one

Comment by Sarah Liu [ 25/Mar/11 ]

it seems this is a duplicate issue with LU-142, so close it.

Comment by Sarah Liu [ 25/Mar/11 ]

miss closing, reopen.this is not a duplicated with 161/142

Comment by Peter Jones [ 26/Mar/11 ]

Fan Yong

Are you able to look into this one or should I reassign it?

Peter

Comment by nasf (Inactive) [ 27/Mar/11 ]

I will investigate it.

Comment by nasf (Inactive) [ 29/Mar/11 ]

After some painful debug (because of no familiar with MPI), I found some useful clue eventually. It is the incompatibility between openmpi and metabench installed on Toro nodes caused the failure. I compiled metabench from source code (tiny fix metabench code because of unknown parameter from MPI lib when start, not sure why) with openmpi-devel installed, then metabench test run successfully. So it is nothing related with Lustre. We need new MPI lib or metabench when deploying test nodes on Toro, but it is out of my control.

Thanks yujian for the help of building test environment.

Comment by Peter Jones [ 29/Mar/11 ]

thanks Fan Yong. Let's reassign this ticket to mjmac to sort out the Toro config

Comment by Michael MacDonald (Inactive) [ 29/Mar/11 ]

fanyong, can you please provide more detail as to what the problem was, and how you fixed it? I would like to update the toolkit build so that future test installs work correctly.

Comment by nasf (Inactive) [ 29/Mar/11 ]

Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

1) install openmpi-devel on your test node
2) compile metabench from source code.
3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
4) fix the metabench.c to ignore such unknown parameter "?", and recompile
5) then run parallel-scale again, it can pass.

I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

Comment by Jian Yu [ 25/Apr/11 ]

Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8)
Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server)
Network: tcp

# rpm -qf /usr/lib64/openmpi/bin/mpirun
openmpi-1.4.1-4.3.el6.x86_64

# rpm -qf /usr/bin/metabench
metabench-1.0-1.wc1.x86_64

The same failure occurred while running metabench test:

[client-13:9663] *** An error occurred in MPI_Gather
[client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0
[client-13:9663] *** MPI_ERR_TRUNCATE: message truncated
[client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9663 on
node client-13 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

Comment by Michael MacDonald (Inactive) [ 13/May/11 ]

The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error.

I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem!

For reference, here is an excerpt from the patch:

     MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG,
-           count_buf,1,MPI_INT,proc0,*my_comm));
+           count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm));

I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.

Comment by Sarah Liu [ 15/May/11 ]

Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

Comment by Sarah Liu [ 18/May/11 ]

I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking,LU-344.

Here are both results:
https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af
https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af

Comment by Michael MacDonald (Inactive) [ 20/May/11 ]

I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).

Generated at Sat Feb 10 01:04:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.