Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1324

expected application behavior for timed out read operations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.1.1, Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • https://github.com/chaos/lustre
      Client: Lustre 1.8 BGP
      Server: 2.1.1-3chaos
    • 3
    • 6415

    Description

      A user application on our classified BGP system running a Lustre 1.8 client is having problems reading from 2.1 servers. We are still light on details about what exact errors the application is getting back from reads, if any. But on the client side we see reads timing out, lost connections, and EBUSY errors while reconnecting:

      Request ost_read sent 675s ago to 172.18.102.48@tcp1 has timed out (limit 675s)
      Connection to ls2-OST029f (at 172.18.102.48@tcp1) was lost; in progress operations using the service will wait for recovery to complete
      An error occurred while communicating with 172.18.102.48@tcp1; the ost_connect operation failed with -16
      (repeats several times)
      Connection restored to ls2-OST029f (at 172.18.102.48@tcp1)

      While on the server we get many of these corresponding events:

      Lustre: ls2-OST029f: Client <uuid> reconnecting
      Lustre: ls2-OST029f: Client <uuid> refused reconnection, still busy with 2 active RPCs
      LustreError: ldlm_lib.c:2614:target_bulk_io()) @@@ build PUT failed: rc -107 ... rc 0/-1
      Lustre: ls2-OST029f: Build IO read error with <uuid> ... client will retry: -107
      Lustre: ldlm_lib.c:913:target_handle_connect()) ls2-OST-29f: connection from <uuid> ...

      My understanding is that all of this should be transparent to the application and no error should propagate to user space unless the client is evicted. Is this correct?

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              nedbass Ned Bass
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: