[LU-2453] parallel-scale test_write_disjoint: invalid file size 723793 instead of 827192 = 103399 * 8 Created: 10/Dec/12  Updated: 06/May/13  Resolved: 06/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: Lustre 2.1.4

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148
Distro/Arch: RHEL5.8/x86_64 (kernel version: 2.6.18-308.20.1.el5)
Network: TCP (1GigE)


Issue Links:
Duplicate
is duplicated by LU-2452 parallel-scale test_write_append_trun... Closed
Severity: 3
Bugzilla ID: 2,304
Rank (Obsolete): 5793

 Description   

The parallel-scale test write_disjoint failed as follows:

== parallel-scale test write_disjoint: write_disjoint ================================================ 14:32:34 (1355005954)
OPTIONS:
WRITE_DISJOINT=/usr/lib64/lustre/tests/write_disjoint
clients=fat-intel-3vm5,fat-intel-3vm6.lab.whamcloud.com 
wdisjoint_THREADS=4
wdisjoint_REP=10000
MACHINEFILE=/tmp/parallel-scale.machines
fat-intel-3vm5
fat-intel-3vm6.lab.whamcloud.com
+ /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000
+ chmod 0777 /mnt/lustre
drwxrwxrwx 5 root root 4096 Dec  8 14:32 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/openmpi/1.4-gcc/bin/mpirun -mca boot ssh -np 8 -machinefile /tmp/parallel-scale.machines /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000 "
--------------------------------------------------------------------------
[[22376,1],3]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: fat-intel-3vm6.lab.whamcloud.com

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
loop 0: chunk_size 103399
rank 3, loop 0: invalid file size 723793 instead of 827192 = 103399 * 8
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
rank 4, loop 0: invalid file size 723793 instead of 827192 = 103399 * 8
rank 7, loop 0: invalid file size 723793 instead of 827192 = 103399 * 8
--------------------------------------------------------------------------

Maloo report: https://maloo.whamcloud.com/test_sets/bfc081dc-41bf-11e2-a653-52540035b04c



 Comments   
Comment by Peter Jones [ 11/Dec/12 ]

Lai is looking into this one

Comment by Lai Siyao [ 11/Dec/12 ]

Strange I can't reproduce it in my setup (rhel 6), I'll test on rhel5 tomorrow.

Comment by Lai Siyao [ 12/Dec/12 ]

I reproduced it on rhel5, and http://review.whamcloud.com/#change,4482 for LU-2170 looks to be the culprit, after I revert that commit, this test can pass. Jinshan, could you take a look?

Comment by Jinshan Xiong (Inactive) [ 12/Dec/12 ]

I applied patch of LU-2304 to b2_1 branch and didn't see this issue any more.

Comment by Jian Yu [ 12/Dec/12 ]

Please backport the patch of LU-2304 to b2_1 branch. Thanks.

Comment by Peter Jones [ 12/Dec/12 ]

Yujian

The port seems to be here - http://review.whamcloud.com/#change,4818

Peter

Comment by Lai Siyao [ 06/May/13 ]

landed

Generated at Sat Feb 10 01:25:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.