[LU-1360] Test failure on test suite parallel-scale-nfsv3, subtest test_metabench Created: 02/May/12  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 2.1.5, Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Won't Fix Votes: 0
Labels: mq213

Attachments: File 1360.tar.gz    
Severity: 3
Rank (Obsolete): 4036

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b019eb0a-929d-11e1-9e8b-525400d2bfa6.

The sub-test test_metabench failed with the following error:

metabench failed! 1

== parallel-scale-nfsv3 test metabench: metabench ==================================================== 18:11:14 (1335748274)
OPTIONS:
METABENCH=/usr/bin/metabench
clients=iu-3vm1.lab.whamcloud.com,iu-3vm2
mbench_NFILES=30400
mbench_THREADS=4
iu-3vm1.lab.whamcloud.com
iu-3vm2
+ /usr/bin/metabench -w /mnt/lustre/d0.metabench -c 30400 -C -S -k
+ chmod 0777 /mnt/lustre
drwxrwxrwx 4 root root 4096 Apr 29 18:11 /mnt/lustre
+ su mpiuser sh -c "/usr/lib/openmpi/1.4-gcc/bin/mpirun -mca boot ssh -mca btl tcp,self -np 8 -machinefile /tmp/parallel-scale-nfsv3.machines /usr/bin/metabench -w /mnt/lustre/d0.metabench -c 30400 -C -S -k "
Metadata Test <no-name> on 04/29/2012 at 18:11:19

Rank 0 process on node iu-3vm1.lab.whamcloud.com
Rank 1 process on node iu-3vm2.lab.whamcloud.com
Rank 2 process on node iu-3vm1.lab.whamcloud.com
Rank 3 process on node iu-3vm2.lab.whamcloud.com
Rank 4 process on node iu-3vm1.lab.whamcloud.com
Rank 5 process on node iu-3vm2.lab.whamcloud.com
Rank 6 process on node iu-3vm1.lab.whamcloud.com
Rank 7 process on node iu-3vm2.lab.whamcloud.com

[04/29/2012 18:11:19] FATAL error on process 0
Proc 0: Cant stat [d0.metabench]: Value too large for defined data type
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 2161 on
node iu-3vm1.lab.whamcloud.com exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[iu-3vm2.lab.whamcloud.com][[3254,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [iu-3vm1.lab.whamcloud.com][[3254,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[iu-3vm1.lab.whamcloud.com][[3254,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
parallel-scale-nfsv3 test_metabench: @@@@@@ FAIL: metabench failed! 1
Dumping lctl log to /logdir/test_logs/2012-04-28/lustre-b2_1-el5-x86_64-el5-i686_51_-7ff324267018/parallel-scale-nfsv3.test_metabench.*.1335748279.log



 Comments   
Comment by Peter Jones [ 07/May/12 ]

Assign to Bob

Comment by Sarah Liu [ 08/May/12 ]

debug and dmesg log form both server and client

Comment by Jian Yu [ 24/May/12 ]

This is a regression on Lustre b2_1 branch.

Another instance:
https://maloo.whamcloud.com/test_sets/2496ae14-a4b9-11e1-adce-52540035b04c

Comment by Jian Yu [ 31/May/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64(server), RHEL6.2/i686(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The same failure occurred: https://maloo.whamcloud.com/test_sets/f13b8f5a-aac9-11e1-bd84-52540035b04c

Comment by Jian Yu [ 21/Aug/12 ]

The same issue occurred on Lustre 2.1.3 RC2:
https://maloo.whamcloud.com/test_sets/a5d75d36-eb34-11e1-ba73-52540035b04c

Comment by Jian Yu [ 20/Dec/12 ]

The same issue occurred on Lustre 2.1.4 RC1:
https://maloo.whamcloud.com/test_sets/9e85f1a2-4ad7-11e2-b87e-52540035b04c

Comment by Jian Yu [ 24/Mar/13 ]

The same issue occurred on Lustre 2.1.5 RC1:
https://maloo.whamcloud.com/test_sets/01308e2e-93a9-11e2-89cc-52540035b04c

Comment by Jian Yu [ 26/Mar/13 ]

Another instance on Lustre 2.1.5 RC1:
https://maloo.whamcloud.com/test_sets/2d8c4fee-95bb-11e2-bc9e-52540035b04c

Comment by Jian Yu [ 27/May/13 ]

Lustre b2_1 build: http://build.whamcloud.com/job/lustre-b2_1/204
https://maloo.whamcloud.com/test_sets/776715d0-c63b-11e2-ad5d-52540035b04c

Comment by Jian Yu [ 05/Jun/13 ]

The same issue occurred on Lustre 2.1.6 RC1:
https://maloo.whamcloud.com/test_sets/bae7a9ee-cd4a-11e2-a1e0-52540035b04c

Comment by Jian Yu [ 27/Jun/13 ]

The same issue occurred on Lustre 2.1.6 RC2:
https://maloo.whamcloud.com/test_sets/217ee754-dd7b-11e2-85a3-52540035b04c
https://maloo.whamcloud.com/test_sets/d54dfb12-dd7b-11e2-85a3-52540035b04c

Comment by Andreas Dilger [ 23/Nov/13 ]

This test is still failing on a regular basis in full testing on b2_4 and master:

https://maloo.whamcloud.com/test_sets/bd013d78-543e-11e3-9029-52540035b04c
https://maloo.whamcloud.com/test_sets/5751a248-5168-11e3-8300-52540035b04c
https://maloo.whamcloud.com/test_sets/795032bc-5148-11e3-9ca9-52540035b04c

Bob, can you please at least make an initial investigation of what the problem is. It does appear that the test passes 1/2 of the time, so if this can be isolated to a specific Lustre version or interop config perhaps we can fix the problem or skip testing it.

Comment by James A Simmons [ 14/Aug/16 ]

Really old blocker for unsupported version

Generated at Sat Feb 10 01:15:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.