Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.1.0, Lustre 2.4.0, Lustre 1.8.7, Lustre 2.5.0
-
None
-
Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: TCP
ENABLE_QUOTA=yes
Lustre Servers:
Tag: v2_1_0_0_RC2
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: TCPLustre Clients: Tag: 1.8.6-wc1 Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6) Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/ Network: TCP ENABLE_QUOTA=yes Lustre Servers: Tag: v2_1_0_0_RC2 Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64) Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/ Network: TCP
-
3
-
5514
Description
v2_1_0_0_RC2 testing, MPI_ABORT for unknown reason. No console, syslog at all in the report (maloo bug?)
Report: https://maloo.whamcloud.com/test_sets/44dc4934-e440-11e0-9909-52540025f9af
== parallel-scale test write_disjoint: write_disjoint == 14:43:05 (1316554985)
OPTIONS:
WRITE_DISJOINT=/usr/lib64/lustre/tests/write_disjoint
clients=fat-intel-1vm1,fat-intel-1vm2
wdisjoint_THREADS=4
wdisjoint_REP=10000
MACHINEFILE=/tmp/parallel-scale.machines
fat-intel-1vm1
fat-intel-1vm2
+ /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000
UUID Inodes IUsed IFree IUse% Mounted on
lustre-MDT0000_UUID 5000040 86 4999954 0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 167552 10974 156578 7% /mnt/lustre[OST:0]
lustre-OST0001_UUID 167552 11326 156226 7% /mnt/lustre[OST:1]
lustre-OST0002_UUID 167552 3807 163745 2% /mnt/lustre[OST:2]
lustre-OST0003_UUID 167552 4830 162722 3% /mnt/lustre[OST:3]
lustre-OST0004_UUID 167552 3806 163746 2% /mnt/lustre[OST:4]
lustre-OST0005_UUID 167552 3646 163906 2% /mnt/lustre[OST:5]
lustre-OST0006_UUID 167552 3806 163746 2% /mnt/lustre[OST:6]
filesystem summary: 5000040 86 4999954 0% /mnt/lustre
+ chmod 0777 /mnt/lustre
drwxrwxrwx 7 root root 4096 Sep 20 14:43 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 8 -machinefile /tmp/parallel-scale.machines /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000 "
loop 0: chunk_size 103399
loop 1000: chunk_size 69125
loop 2000: chunk_size 104360
loop 3000: chunk_size 11295
loop 4000: chunk_size 77918
loop 5000: chunk_size 27295
loop 6000: chunk_size 42065
loop 7000: chunk_size 82749
loop 8000: chunk_size 94370
loop 9000: chunk_size 107226
loop 9371: chunk_size 25301, file size was 202408
rank 5, loop 9372: invalid file size 801136 instead of 915584 = 114448 * 8
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 5 with PID 30944 on
node fat-intel-1vm2 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[fat-intel-1vm2.lab.whamcloud.com][[61908,1],7][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [fat-intel-1vm1.lab.whamcloud.com][[61908,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
UUID Inodes IUsed IFree IUse% Mounted on
lustre-MDT0000_UUID 5000040 87 4999953 0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 167552 10974 156578 7% /mnt/lustre[OST:0]
lustre-OST0001_UUID 167552 11326 156226 7% /mnt/lustre[OST:1]
lustre-OST0002_UUID 167552 3806 163746 2% /mnt/lustre[OST:2]
lustre-OST0003_UUID 167552 4830 162722 3% /mnt/lustre[OST:3]
lustre-OST0004_UUID 167552 3806 163746 2% /mnt/lustre[OST:4]
lustre-OST0005_UUID 167552 3646 163906 2% /mnt/lustre[OST:5]
lustre-OST0006_UUID 167552 3806 163746 2% /mnt/lustre[OST:6]
filesystem summary: 5000040 87 4999953 0% /mnt/lustre
parallel-scale test_write_disjoint: @@@@@@ FAIL: write_disjoint failed! 1
Dumping lctl log to /logdir/test_logs/2011-09-19/lustre-mixed-el6-x86_64_283_-7f6a2ad2c9e0/parallel-scale.test_write_disjoint.*.1316557553.log
Resetting fail_loc on all nodes...done.
Attachments
Issue Links
- duplicates
-
LU-3027 Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8
- Resolved
- Trackbacks
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC2 Tag: v2100RC2 Build: