Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-816

Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 8545

    Description

      Hi,

      Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in LU-142, except that it is on connections between clients and OSS, instead of clients and MDS.
      These customers are installed with Lustre 2.0.0.1 Bull, which does not include the LU-142 patch.
      DO you think it is the same problem as described in LU-142 and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?

      Here are traces collected by our on site support on a customer site:

      Users reported hung applications/jobs, mainly in Slurm's "Completing" state.
      
      Logs on affected Clients/nodes have plenty of :
      "LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs.
      
      To find the details of the failing connection on the Client side we use :
      # grep current /proc/fs/lustre/osc/*/state | grep -v FULL
      -->> one OST connection will show q "CONNECTING" state.
      
      Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others
      :
      "Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..."
      "LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...."
      
      on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...".
      
      On the other/new identified Client, logs contain repeating msgs of the type :
      "Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply"
      
      #consequences:
      No other way to unblock the situation than to crash/dump the other/new identified Client !!!
       
      #details:
      To come in further comments/add-ons !!
      
      

      Attachments

        Activity

          [LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever

          Last comment is that a patch was being tested.

          jfc John Fuchs-Chesney (Inactive) added a comment - Last comment is that a patch was being tested.
          pjones Peter Jones added a comment -

          Bull now believe this to be a duplicate of LU-948 and are testing out the patch

          pjones Peter Jones added a comment - Bull now believe this to be a duplicate of LU-948 and are testing out the patch

          The Lustre 2.1.1 Bull release containing LU-1274 patch has been installed on several customer sites.
          AWE customer reports that the problem described in LU-1274 no longer occurs since the efix installation, a few weeks ago.
          But CEA customer, which is deploying the same efix, reports that the problem initially described in LU-816 and declared as duplicate of LU-1274 re-occured since a few days. So I have transfered the latest syslog file from one the OSS server they provided (uploads/LU-816/cartan.log2). As this syslog is rather old, I have asked them to provided a new copy of the syslog on both client and OSS side, and all the thread stacks on the OSS side.

          lustre-bull Lustre Bull (Inactive) added a comment - The Lustre 2.1.1 Bull release containing LU-1274 patch has been installed on several customer sites. AWE customer reports that the problem described in LU-1274 no longer occurs since the efix installation, a few weeks ago. But CEA customer, which is deploying the same efix, reports that the problem initially described in LU-816 and declared as duplicate of LU-1274 re-occured since a few days. So I have transfered the latest syslog file from one the OSS server they provided (uploads/ LU-816 /cartan.log2). As this syslog is rather old, I have asked them to provided a new copy of the syslog on both client and OSS side, and all the thread stacks on the OSS side.
          pjones Peter Jones added a comment -

          ok thanks Patrick

          pjones Peter Jones added a comment - ok thanks Patrick

          On site support reports that the problem did not occur again since the installation of the efix containing LU-1274 patch, one month ago.

          patrick.valentin Patrick Valentin (Inactive) added a comment - On site support reports that the problem did not occur again since the installation of the efix containing LU-1274 patch, one month ago.

          From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages.

          Thanks.

          jay Jinshan Xiong (Inactive) added a comment - From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages. Thanks.

          Here is the list of patches that were present in the customer lustre release.
          This corresponds to the Bull delivey identified as "T-2_0_0-lustrebull-EFIX7_AE1_1" and produced on 4 october 2011.

          bz16919
          bz20687
          bz21732
          bz21122
          bz21804
          bz22078
          bz22360
          bz22375
          bz22421
          bz22683
          bz23035
          bz23120
          bz23123
          bz23289
          bz23298
          bz23357
          bz23399
          bz23460
          bz24010
          bz24291
          bz24420
          LU-81
          LU-91
          LU-122
          LU-128
          LU-130
          LU-148
          LU-185
          LU-190
          LU-255
          LU-275
          LU-300
          LU-328
          LU-361
          LU-369
          LU-394
          LU-416
          LU-418
          LU-435
          LU-437
          LU-442
          LU-484
          LU_542
          LU_585
          LU-601 patch_set_7
          LU-613
          LU_651
          LU-685

          The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following:
          LU-234
          LU-333
          LU-399
          LU-481
          LU-543
          LU-601 patch_set_13
          LU-687
          LU-815
          LU-857

          patrick.valentin Patrick Valentin (Inactive) added a comment - Here is the list of patches that were present in the customer lustre release. This corresponds to the Bull delivey identified as "T-2_0_0-lustrebull-EFIX7_AE1_1" and produced on 4 october 2011. bz16919 bz20687 bz21732 bz21122 bz21804 bz22078 bz22360 bz22375 bz22421 bz22683 bz23035 bz23120 bz23123 bz23289 bz23298 bz23357 bz23399 bz23460 bz24010 bz24291 bz24420 LU-81 LU-91 LU-122 LU-128 LU-130 LU-148 LU-185 LU-190 LU-255 LU-275 LU-300 LU-328 LU-361 LU-369 LU-394 LU-416 LU-418 LU-435 LU-437 LU-442 LU-484 LU_542 LU_585 LU-601 patch_set_7 LU-613 LU_651 LU-685 The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following: LU-234 LU-333 LU-399 LU-481 LU-543 LU-601 patch_set_13 LU-687 LU-815 LU-857

          it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer?

          jay Jinshan Xiong (Inactive) added a comment - it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer?

          From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish.

          I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock.

          Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8.

          yong.fan nasf (Inactive) added a comment - From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish. I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock. Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8.

          Hi,
          below is the answer provided by on site support. I have also attached the file (crash trace) they provided.

          Quotas are not used nor active at Tera-100.
          You will find attached "foreach_bt_cartan1121" file containing all Client threads stacks (via "bt -t") when problem occured.

          patrick.valentin Patrick Valentin (Inactive) added a comment - Hi, below is the answer provided by on site support. I have also attached the file (crash trace) they provided. Quotas are not used nor active at Tera-100. You will find attached "foreach_bt_cartan1121" file containing all Client threads stacks (via "bt -t") when problem occured.

          People

            jay Jinshan Xiong (Inactive)
            lustre-bull Lustre Bull (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: