[LU-157] metabench failed on parallel-scale test - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Labels:
None
Environment:
separated MDS and OSS, 3 clients

Severity:
3
Rank (Obsolete):
10083

Description

metabench test failed on lustre client, can be reproduced.

test log
-----------
[03/23/2011 23:15:53] Leaving time_file_creation with proc_id = 11
[03/23/2011 23:15:53] Entering par_create_multidir to create 910 files in 1 dirs
Removed 10000 files in 8.325 seconds
[client-5.lab.whamcloud.com:6909] *** An error occurred in MPI_Gather
[client-5.lab.whamcloud.com:6909] *** on communicator MPI COMMUNICATOR 14 CREATEE
FROM 0
[client-5.lab.whamcloud.com:6909] *** MPI_ERR_TRUNCATE: message truncated
[client-5.lab.whamcloud.com:6909] *** MPI_ERRORS_ARE_FATAL (your MPI job will noo
w abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6909 on
node client-5.lab.whamcloud.com exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Attachments

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-157] metabench failed on parallel-scale test

Michael MacDonald (Inactive) added a comment - 20/May/11 4:14 PM

I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).

Michael MacDonald (Inactive) added a comment - 20/May/11 4:14 PM I'm going to resolve this, as the original issue with bad code in metabench has been fixed. Please open new tickets for the other problems (e.g. cascading_rw).

Sarah Liu added a comment - 18/May/11 11:39 PM

I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking,~~LU-344~~.

Here are both results:
https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af
https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af

Sarah Liu added a comment - 18/May/11 11:39 PM I verified this bug on the latest last-master for RHEL5-x86_84, metabench passes on NFS3 but failed on NFS4. I think the failure is not related to mpi, so open a new ticket for tracking, LU-344 . Here are both results: https://maloo.whamcloud.com/test_sets/0df23b78-81de-11e0-b4df-52540025f9af https://maloo.whamcloud.com/test_sets/3c243272-81e2-11e0-b4df-52540025f9af

Sarah Liu added a comment - 15/May/11 10:48 PM - edited

Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

Sarah Liu added a comment - 15/May/11 10:48 PM - edited Hi Mike, I ran cascading_rw on build lustre-master/rhel6-x86_64/#118, it continued running for almost two days and hasn't finished yet. Does this build contain the latest openmpi?

Michael MacDonald (Inactive) added a comment - 13/May/11 9:53 AM

The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error.

I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem!

For reference, here is an excerpt from the patch:

     MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG,
-           count_buf,1,MPI_INT,proc0,*my_comm));
+           count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm));

I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.

Michael MacDonald (Inactive) added a comment - 13/May/11 9:53 AM The real problem is that the receive buffer was defined as MPI_INT but the send buffer was defined as MPI_UNSIGNED_LONG. When compiled with gcc on x86_64, longs (8 bytes) don't fit into ints (4 bytes), hence the MPI_ERR_TRUNCATE error. I've committed a small patch which corrects this, and I'm waiting for RPMs to build across all platforms. I've already verified this on EL6/x86_64; please resolve the ticket when other platforms are verified in the normal course of testing. I'm confident that this issue is fixed, though, as it was a simple problem with a simple solution, once I understood the problem! For reference, here is an excerpt from the patch: MPI_SAFE(MPI_Gather(&count,1,MPI_UNSIGNED_LONG, - count_buf,1,MPI_INT,proc0,*my_comm)); + count_buf,1,MPI_UNSIGNED_LONG,proc0,*my_comm)); I've applied this fix to all MPI_Gather instances with mismatched send/receive datatypes.

Jian Yu added a comment - 25/Apr/11 6:44 PM

Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8)
Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server)
Network: tcp

# rpm -qf /usr/lib64/openmpi/bin/mpirun
openmpi-1.4.1-4.3.el6.x86_64

# rpm -qf /usr/bin/metabench
metabench-1.0-1.wc1.x86_64

The same failure occurred while running metabench test:

[client-13:9663] *** An error occurred in MPI_Gather
[client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0
[client-13:9663] *** MPI_ERR_TRUNCATE: message truncated
[client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9663 on
node client-13 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

Jian Yu added a comment - 25/Apr/11 6:44 PM Branch: b1_8 (Revision: c5c2986be490b2fbceb4b38d6c983d279f4bbcf8) Distro/Arch: RHEL6/x86_64 (patchless client), RHEL5/x86_64 (server) Network: tcp # rpm -qf /usr/lib64/openmpi/bin/mpirun openmpi-1.4.1-4.3.el6.x86_64 # rpm -qf /usr/bin/metabench metabench-1.0-1.wc1.x86_64 The same failure occurred while running metabench test: [client-13:9663] *** An error occurred in MPI_Gather [client-13:9663] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 0 [client-13:9663] *** MPI_ERR_TRUNCATE: message truncated [client-13:9663] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9663 on node client-13 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Maloo report: https://maloo.whamcloud.com/test_sets/4673d0c8-6cb3-11e0-b32b-52540025f9af

nasf (Inactive) added a comment - 29/Mar/11 8:18 PM

Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following:

1) install openmpi-devel on your test node
2) compile metabench from source code.
3) run parallel-scale with new metabench, it will report "Invalid Arg ?"
4) fix the metabench.c to ignore such unknown parameter "?", and recompile
5) then run parallel-scale again, it can pass.

I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

nasf (Inactive) added a comment - 29/Mar/11 8:18 PM Currently, the openmpi installed on Toro nodes is openmpi-1.4-4.el5, the metabench is metabench-1.0-1.wc1, there are some incompatibility between them. I do not know the detailed reason, but you can try as following: 1) install openmpi-devel on your test node 2) compile metabench from source code. 3) run parallel-scale with new metabench, it will report "Invalid Arg ?" 4) fix the metabench.c to ignore such unknown parameter "?", and recompile 5) then run parallel-scale again, it can pass. I have put the workable metabench under /tmp/metabench on Brent node, which can run on 2.6.18-194.17.1.el5. I am not sure how to fix it easily, maybe use "MPICH" or fix metabench.

People

Assignee:: Michael MacDonald (Inactive)

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Mar/11 12:55 PM

Updated:: 20/May/11 4:14 PM

Resolved:: 20/May/11 4:14 PM