Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
None
-
None
-
3
-
8545
Description
Hi,
Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in LU-142, except that it is on connections between clients and OSS, instead of clients and MDS.
These customers are installed with Lustre 2.0.0.1 Bull, which does not include the LU-142 patch.
DO you think it is the same problem as described in LU-142 and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?
Here are traces collected by our on site support on a customer site:
Users reported hung applications/jobs, mainly in Slurm's "Completing" state. Logs on affected Clients/nodes have plenty of : "LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs. To find the details of the failing connection on the Client side we use : # grep current /proc/fs/lustre/osc/*/state | grep -v FULL -->> one OST connection will show q "CONNECTING" state. Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others : "Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..." "LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...." on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...". On the other/new identified Client, logs contain repeating msgs of the type : "Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply" #consequences: No other way to unblock the situation than to crash/dump the other/new identified Client !!! #details: To come in further comments/add-ons !!