Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-816

Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 8545

    Description

      Hi,

      Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in LU-142, except that it is on connections between clients and OSS, instead of clients and MDS.
      These customers are installed with Lustre 2.0.0.1 Bull, which does not include the LU-142 patch.
      DO you think it is the same problem as described in LU-142 and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?

      Here are traces collected by our on site support on a customer site:

      Users reported hung applications/jobs, mainly in Slurm's "Completing" state.
      
      Logs on affected Clients/nodes have plenty of :
      "LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs.
      
      To find the details of the failing connection on the Client side we use :
      # grep current /proc/fs/lustre/osc/*/state | grep -v FULL
      -->> one OST connection will show q "CONNECTING" state.
      
      Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others
      :
      "Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..."
      "LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...."
      
      on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...".
      
      On the other/new identified Client, logs contain repeating msgs of the type :
      "Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply"
      
      #consequences:
      No other way to unblock the situation than to crash/dump the other/new identified Client !!!
       
      #details:
      To come in further comments/add-ons !!
      
      

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            lustre-bull Lustre Bull
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: