[LU-1324] expected application behavior for timed out read operations Created: 13/Apr/12  Updated: 04/Jun/12  Resolved: 04/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1, Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre
Client: Lustre 1.8 BGP
Server: 2.1.1-3chaos


Severity: 3
Rank (Obsolete): 6415

 Description   

A user application on our classified BGP system running a Lustre 1.8 client is having problems reading from 2.1 servers. We are still light on details about what exact errors the application is getting back from reads, if any. But on the client side we see reads timing out, lost connections, and EBUSY errors while reconnecting:

Request ost_read sent 675s ago to 172.18.102.48@tcp1 has timed out (limit 675s)
Connection to ls2-OST029f (at 172.18.102.48@tcp1) was lost; in progress operations using the service will wait for recovery to complete
An error occurred while communicating with 172.18.102.48@tcp1; the ost_connect operation failed with -16
(repeats several times)
Connection restored to ls2-OST029f (at 172.18.102.48@tcp1)

While on the server we get many of these corresponding events:

Lustre: ls2-OST029f: Client <uuid> reconnecting
Lustre: ls2-OST029f: Client <uuid> refused reconnection, still busy with 2 active RPCs
LustreError: ldlm_lib.c:2614:target_bulk_io()) @@@ build PUT failed: rc -107 ... rc 0/-1
Lustre: ls2-OST029f: Build IO read error with <uuid> ... client will retry: -107
Lustre: ldlm_lib.c:913:target_handle_connect()) ls2-OST-29f: connection from <uuid> ...

My understanding is that all of this should be transparent to the application and no error should propagate to user space unless the client is evicted. Is this correct?



 Comments   
Comment by Peter Jones [ 15/Apr/12 ]

Bobi

Could you please comment on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 16/Apr/12 ]

yes, and the client application I/O will wait untill being evicted.

Comment by Ned Bass [ 16/Apr/12 ]

Thanks. Also, could these errors result in fewer than the requested number of bytes being read (i.e. short reads)?

Comment by Zhenyu Xu [ 16/Apr/12 ]

The IO rpc won't be returned to client if network issue happens, and client app will get -EIO if it fails to reconnecting to the OST from which it tries to get data.

Comment by Zhenyu Xu [ 03/Jun/12 ]

Ned,

Any further question about this ticket?

Comment by Ned Bass [ 04/Jun/12 ]

Hi,

No further questions, feel free to close the ticket.

Thanks

Comment by Peter Jones [ 04/Jun/12 ]

Thanks Prakash.

Generated at Sat Feb 10 01:15:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.