Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.1.1, Lustre 1.8.x (1.8.0 - 1.8.5)
-
None
-
-
3
-
6415
Description
A user application on our classified BGP system running a Lustre 1.8 client is having problems reading from 2.1 servers. We are still light on details about what exact errors the application is getting back from reads, if any. But on the client side we see reads timing out, lost connections, and EBUSY errors while reconnecting:
Request ost_read sent 675s ago to 172.18.102.48@tcp1 has timed out (limit 675s)
Connection to ls2-OST029f (at 172.18.102.48@tcp1) was lost; in progress operations using the service will wait for recovery to complete
An error occurred while communicating with 172.18.102.48@tcp1; the ost_connect operation failed with -16
(repeats several times)
Connection restored to ls2-OST029f (at 172.18.102.48@tcp1)
While on the server we get many of these corresponding events:
Lustre: ls2-OST029f: Client <uuid> reconnecting
Lustre: ls2-OST029f: Client <uuid> refused reconnection, still busy with 2 active RPCs
LustreError: ldlm_lib.c:2614:target_bulk_io()) @@@ build PUT failed: rc -107 ... rc 0/-1
Lustre: ls2-OST029f: Build IO read error with <uuid> ... client will retry: -107
Lustre: ldlm_lib.c:913:target_handle_connect()) ls2-OST-29f: connection from <uuid> ...
My understanding is that all of this should be transparent to the application and no error should propagate to user space unless the client is evicted. Is this correct?
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker
While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA