[LU-873] IOR single shared file test fails Created: 22/Nov/11  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Hyperion/LLNL


Severity: 3
Rank (Obsolete): 10220

 Description   

IOR fails when > 10 clients are run.

0289: ERROR in aiori-POSIX.c (line 256): transfer failed.
0289: ERROR: Input/output error
0289: ** exiting **
0289: [289] [MPI Abort by user] Aborting Program!
0289: [289:hyperion355] Abort: MPI_Abort() code: -1, rank 289, MPI Abort by user Aborting program ! at line 99 in file mpid_init.c
https://maloo.whamcloud.com/test_sessions/53fb3ae2-155a-11e1-b669-52540025f9af



 Comments   
Comment by Peter Jones [ 22/Nov/11 ]

Bobi

Could you please look into this one?

Thanks

Peter

Comment by Cliff White (Inactive) [ 23/Nov/11 ]

This may be related/identical to LU-873, filed by LLNL.

Comment by Jinshan Xiong (Inactive) [ 23/Nov/11 ]

I took a look at the log. It looks like the client hyperion360 was requesting an RW lock with local cookie: 0x389db5cc182f83cd from OST0000. However, the completion RPC was dropped(client never got this RPC) so that the status of this lock on the server is granted, but client kept waiting completion; then this lock on the client will never be revoked because a process is waitinig for it. This is just my guess because the log was truncated.

Please also notice that completion AST is not resendable; also there are several bulk transfer error on the console. I don't know if network ran into problem at that time.

Comment by Christopher Morrone [ 01/Dec/11 ]

LLNL's ticket is LU-874.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:11:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.