[LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
8545

Description

Hi,

Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in ~~LU-142~~, except that it is on connections between clients and OSS, instead of clients and MDS.
These customers are installed with Lustre 2.0.0.1 Bull, which does not include the ~~LU-142~~ patch.
DO you think it is the same problem as described in ~~LU-142~~ and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?

Here are traces collected by our on site support on a customer site:

Users reported hung applications/jobs, mainly in Slurm's "Completing" state.

Logs on affected Clients/nodes have plenty of :
"LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs.

To find the details of the failing connection on the Client side we use :
# grep current /proc/fs/lustre/osc/*/state | grep -v FULL
-->> one OST connection will show q "CONNECTING" state.

Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others
:
"Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..."
"LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...."

on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...".

On the other/new identified Client, logs contain repeating msgs of the type :
"Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply"

#consequences:
No other way to unblock the situation than to crash/dump the other/new identified Client !!!
 
#details:
To come in further comments/add-ons !!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_cartan1121
309 kB
28/Nov/11 1:02 PM

Activity

[LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever

John Fuchs-Chesney (Inactive) added a comment - 08/Mar/14 12:08 AM

Last comment is that a patch was being tested.

John Fuchs-Chesney (Inactive) added a comment - 08/Mar/14 12:08 AM Last comment is that a patch was being tested.

Peter Jones added a comment - 14/Jun/12 10:16 AM

Bull now believe this to be a duplicate of ~~LU-948~~ and are testing out the patch

Peter Jones added a comment - 14/Jun/12 10:16 AM Bull now believe this to be a duplicate of LU-948 and are testing out the patch

Lustre Bull (Inactive) added a comment - 12/Jun/12 1:21 PM

The Lustre 2.1.1 Bull release containing ~~LU-1274~~ patch has been installed on several customer sites.
AWE customer reports that the problem described in ~~LU-1274~~ no longer occurs since the efix installation, a few weeks ago.
But CEA customer, which is deploying the same efix, reports that the problem initially described in ~~LU-816~~ and declared as duplicate of ~~LU-1274~~ re-occured since a few days. So I have transfered the latest syslog file from one the OSS server they provided (uploads/~~LU-816~~/cartan.log2). As this syslog is rather old, I have asked them to provided a new copy of the syslog on both client and OSS side, and all the thread stacks on the OSS side.

Lustre Bull (Inactive) added a comment - 12/Jun/12 1:21 PM The Lustre 2.1.1 Bull release containing LU-1274 patch has been installed on several customer sites. AWE customer reports that the problem described in LU-1274 no longer occurs since the efix installation, a few weeks ago. But CEA customer, which is deploying the same efix, reports that the problem initially described in LU-816 and declared as duplicate of LU-1274 re-occured since a few days. So I have transfered the latest syslog file from one the OSS server they provided (uploads/ LU-816 /cartan.log2). As this syslog is rather old, I have asked them to provided a new copy of the syslog on both client and OSS side, and all the thread stacks on the OSS side.

Peter Jones added a comment - 29/May/12 9:05 AM

ok thanks Patrick

Peter Jones added a comment - 29/May/12 9:05 AM ok thanks Patrick

Patrick Valentin (Inactive) added a comment - 29/May/12 6:49 AM

On site support reports that the problem did not occur again since the installation of the efix containing ~~LU-1274~~ patch, one month ago.

Patrick Valentin (Inactive) added a comment - 29/May/12 6:49 AM On site support reports that the problem did not occur again since the installation of the efix containing LU-1274 patch, one month ago.

Jinshan Xiong (Inactive) added a comment - 13/Jan/12 7:31 PM

From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages.

Thanks.

Jinshan Xiong (Inactive) added a comment - 13/Jan/12 7:31 PM From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages. Thanks.

Patrick Valentin (Inactive) added a comment - 12/Jan/12 8:22 AM

Here is the list of patches that were present in the customer lustre release.
This corresponds to the Bull delivey identified as "T-2_0_0-lustrebull-EFIX7_AE1_1" and produced on 4 october 2011.

bz16919
bz20687
bz21732
bz21122
bz21804
bz22078
bz22360
bz22375
bz22421
bz22683
bz23035
bz23120
bz23123
bz23289
bz23298
bz23357
bz23399
bz23460
bz24010
bz24291
bz24420
~~LU-81~~
~~LU-91~~
~~LU-122~~
~~LU-128~~
~~LU-130~~
~~LU-148~~
~~LU-185~~
~~LU-190~~
~~LU-255~~
~~LU-275~~
~~LU-300~~
~~LU-328~~
~~LU-361~~
~~LU-369~~
~~LU-394~~
~~LU-416~~
~~LU-418~~
~~LU-435~~
~~LU-437~~
~~LU-442~~
~~LU-484~~
LU_542
LU_585
~~LU-601~~ patch_set_7
~~LU-613~~
LU_651
~~LU-685~~

The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following:
~~LU-234~~
~~LU-333~~
~~LU-399~~
~~LU-481~~
~~LU-543~~
~~LU-601~~ patch_set_13
~~LU-687~~
~~LU-815~~
~~LU-857~~

Patrick Valentin (Inactive) added a comment - 12/Jan/12 8:22 AM Here is the list of patches that were present in the customer lustre release. This corresponds to the Bull delivey identified as "T-2_0_0-lustrebull-EFIX7_AE1_1" and produced on 4 october 2011. bz16919 bz20687 bz21732 bz21122 bz21804 bz22078 bz22360 bz22375 bz22421 bz22683 bz23035 bz23120 bz23123 bz23289 bz23298 bz23357 bz23399 bz23460 bz24010 bz24291 bz24420 LU-81 LU-91 LU-122 LU-128 LU-130 LU-148 LU-185 LU-190 LU-255 LU-275 LU-300 LU-328 LU-361 LU-369 LU-394 LU-416 LU-418 LU-435 LU-437 LU-442 LU-484 LU_542 LU_585 LU-601 patch_set_7 LU-613 LU_651 LU-685 The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following: LU-234 LU-333 LU-399 LU-481 LU-543 LU-601 patch_set_13 LU-687 LU-815 LU-857

Jinshan Xiong (Inactive) added a comment - 02/Dec/11 1:36 AM

it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer?

Jinshan Xiong (Inactive) added a comment - 02/Dec/11 1:36 AM it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer?

nasf (Inactive) added a comment - 01/Dec/11 11:48 AM

From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish.

I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock.

Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8.

nasf (Inactive) added a comment - 01/Dec/11 11:48 AM From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish. I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock. Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8.

Patrick Valentin (Inactive) added a comment - 28/Nov/11 1:07 PM

Hi,
below is the answer provided by on site support. I have also attached the file (crash trace) they provided.

Quotas are not used nor active at Tera-100.
You will find attached "foreach_bt_cartan1121" file containing all Client threads stacks (via "bt -t") when problem occured.

Patrick Valentin (Inactive) added a comment - 28/Nov/11 1:07 PM Hi, below is the answer provided by on site support. I have also attached the file (crash trace) they provided. Quotas are not used nor active at Tera-100. You will find attached "foreach_bt_cartan1121" file containing all Client threads stacks (via "bt -t") when problem occured.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Lustre Bull (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Nov/11 2:38 PM

Updated:: 08/Mar/14 12:08 AM

Resolved:: 08/Mar/14 12:08 AM