[LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
8545

Description

Hi,

Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in ~~LU-142~~, except that it is on connections between clients and OSS, instead of clients and MDS.
These customers are installed with Lustre 2.0.0.1 Bull, which does not include the ~~LU-142~~ patch.
DO you think it is the same problem as described in ~~LU-142~~ and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?

Here are traces collected by our on site support on a customer site:

Users reported hung applications/jobs, mainly in Slurm's "Completing" state.

Logs on affected Clients/nodes have plenty of :
"LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs.

To find the details of the failing connection on the Client side we use :
# grep current /proc/fs/lustre/osc/*/state | grep -v FULL
-->> one OST connection will show q "CONNECTING" state.

Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others
:
"Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..."
"LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...."

on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...".

On the other/new identified Client, logs contain repeating msgs of the type :
"Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply"

#consequences:
No other way to unblock the situation than to crash/dump the other/new identified Client !!!
 
#details:
To come in further comments/add-ons !!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_cartan1121
309 kB
28/Nov/11 1:02 PM

Activity

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Lustre Bull (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Nov/11 2:38 PM

Updated:: 08/Mar/14 12:08 AM

Resolved:: 08/Mar/14 12:08 AM