[LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever Created: 02/Nov/11  Updated: 08/Mar/14  Resolved: 08/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lustre Bull Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File foreach_bt_cartan1121    
Severity: 3
Rank (Obsolete): 8545

 Description   

Hi,

Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in LU-142, except that it is on connections between clients and OSS, instead of clients and MDS.
These customers are installed with Lustre 2.0.0.1 Bull, which does not include the LU-142 patch.
DO you think it is the same problem as described in LU-142 and we only have to include the corresponding patch in our delivery, or is it a similar problem in other parts of code, needing an additional patch ?

Here are traces collected by our on site support on a customer site:

Users reported hung applications/jobs, mainly in Slurm's "Completing" state.

Logs on affected Clients/nodes have plenty of :
"LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs.

To find the details of the failing connection on the Client side we use :
# grep current /proc/fs/lustre/osc/*/state | grep -v FULL
-->> one OST connection will show q "CONNECTING" state.

Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others
:
"Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..."
"LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...."

on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...".

On the other/new identified Client, logs contain repeating msgs of the type :
"Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply"

#consequences:
No other way to unblock the situation than to crash/dump the other/new identified Client !!!
 
#details:
To come in further comments/add-ons !!



 Comments   
Comment by Peter Jones [ 02/Nov/11 ]

Fan Yong

Could you please advise whether this is the same issue as previously observed?

Thanks

Peter

Comment by Aurelien Degremont (Inactive) [ 03/Nov/11 ]

As the patch in LU-142 is only patching mdt_getattr_name_lock() it is very unlikely this patch will fixed the issue related to OST here.

My understanding is that the same kind of LDLM dead lock appeared here. That's the reason why the same kind of messages are shown, but the problem seems different.

Fan Yong, please tell me if I misunderstood the problem.

Comment by nasf (Inactive) [ 03/Nov/11 ]

I am sure it is not the same issue as LU-142. It is better to dump OSS debug_log/stack for further investigation, otherwise it is difficult to say what happened.

Comment by Lustre Bull [ 08/Nov/11 ]

Below is the analysis done by on line support:

And these are the significant stacks I found on the Client owner of the Lustre-lock !!...

All the "callback" threads ("ldlm_cb_<id>") were stuck since a long time with the same kind of following stack :
===================================================
schedule()
__mutex_lock_slowpath()
mutex_lock()
cl_lock_mutex_get()
osc_ldlm_glimpse_ast()
ldlm_callback_handler()
ptlrpc_server_handle_request()
ptlrpc_main()
kernel_thread()
===================================================

and/when only one "ldlm_bl_<id>" thread is stuck with the following stack :
===================================================
schedule()
io_schedule()
sync_page()
__wait_on_bit_lock()
__lock_page()
vvp_page_own()
cl_page_own0()
cl_page_own()
cl_page_gang_lookup()
cl_lock_page_out()
osc_lock_flush()
osc_lock_cancel()
cl_lock_cancel0()
cl_lock_cancel()
osc_ldlm_blocking()
ldlm_handle_bl_callback()
ldlm_bl_thread_main()
kernel_thread()
===================================================

So seems that this "ldlm_bl_<id>" thread started to cancel the Lustre-lock and flush its associated pages, when the Server/OSS decided to reclaim it (thus the "ldlm_cb_<id>" threads) and we encountered some kind of a remote dead-lock situation also impacting/blocking other Clients trying to grant this same lock ...
Comment by nasf (Inactive) [ 08/Nov/11 ]

===================================================
schedule()
__mutex_lock_slowpath()
mutex_lock()
cl_lock_mutex_get()
osc_ldlm_glimpse_ast()
ldlm_callback_handler()
ptlrpc_server_handle_request()
ptlrpc_main()
kernel_thread()
===================================================

Above stack means that all the glimpse callbacks (ldlm_cb_xx) were blocked when try to acquire the mutex on some cl_lock on the client. Because such mutex was held by the ldlm_bl_xx thread which was in cancel such cl_lock. On the other hand, the ldlm_bl_xx thread was trying to flush dirty pages to OSS before canceling the cl_lock, but it was hung with mutex held for some unknown reason. Then it seemed that all the threads on client-side were hung.

Currently, because we do not have lustre debug logs, it is not easy to say why the sync_page is blocked. But there are some similar bugs we found before, one of them is LU-337 (http://jira.whamcloud.com/browse/LU-337). It may be helpful for you, please try the patch:
http://review.whamcloud.com/#change,880

Comment by Sebastien Buisson (Inactive) [ 10/Nov/11 ]

Hi,

As the system is in production, we would like to be sure of the fix before we propose it to the customer.
Which Lustre debug level do you need to carry out your analysis?

Moreover, do you think it is OK if debug logs are activated only after the problem has occurred?

Cheers,
Sebastien.

Comment by nasf (Inactive) [ 16/Nov/11 ]

Firstly, please check whether quota is activated or not on the system. If not, it will not be LU-337.

Anyway, "-1" level debug is better for both client and OSS during the system run. Activating such debug after the problem can help very little.

Currently, I suspect it is LU-337. So if they think that collecting such debug logs maybe affect the system performance much, please collect client-side debug logs firstly to try to reduce affect the whole system.

On the other hand, can you give me the client-side all threads stacks log when the problem occurred? You can get them by "echo t > /proc/sysrq-trigger". Thanks!

Comment by Patrick Valentin (Inactive) [ 28/Nov/11 ]

Hi,
below is the answer provided by on site support. I have also attached the file (crash trace) they provided.

Quotas are not used nor active at Tera-100.
You will find attached "foreach_bt_cartan1121" file containing all Client threads stacks (via "bt -t") when problem occured.

Comment by nasf (Inactive) [ 01/Dec/11 ]

From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish.

I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock.

Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8.

Comment by Jinshan Xiong (Inactive) [ 02/Dec/11 ]

it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer?

Comment by Patrick Valentin (Inactive) [ 12/Jan/12 ]

Here is the list of patches that were present in the customer lustre release.
This corresponds to the Bull delivey identified as "T-2_0_0-lustrebull-EFIX7_AE1_1" and produced on 4 october 2011.

bz16919
bz20687
bz21732
bz21122
bz21804
bz22078
bz22360
bz22375
bz22421
bz22683
bz23035
bz23120
bz23123
bz23289
bz23298
bz23357
bz23399
bz23460
bz24010
bz24291
bz24420
LU-81
LU-91
LU-122
LU-128
LU-130
LU-148
LU-185
LU-190
LU-255
LU-275
LU-300
LU-328
LU-361
LU-369
LU-394
LU-416
LU-418
LU-435
LU-437
LU-442
LU-484
LU_542
LU_585
LU-601 patch_set_7
LU-613
LU_651
LU-685

The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following:
LU-234
LU-333
LU-399
LU-481
LU-543
LU-601 patch_set_13
LU-687
LU-815
LU-857

Comment by Jinshan Xiong (Inactive) [ 13/Jan/12 ]

From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages.

Thanks.

Comment by Patrick Valentin (Inactive) [ 29/May/12 ]

On site support reports that the problem did not occur again since the installation of the efix containing LU-1274 patch, one month ago.

Comment by Peter Jones [ 29/May/12 ]

ok thanks Patrick

Comment by Lustre Bull [ 12/Jun/12 ]

The Lustre 2.1.1 Bull release containing LU-1274 patch has been installed on several customer sites.
AWE customer reports that the problem described in LU-1274 no longer occurs since the efix installation, a few weeks ago.
But CEA customer, which is deploying the same efix, reports that the problem initially described in LU-816 and declared as duplicate of LU-1274 re-occured since a few days. So I have transfered the latest syslog file from one the OSS server they provided (uploads/LU-816/cartan.log2). As this syslog is rather old, I have asked them to provided a new copy of the syslog on both client and OSS side, and all the thread stacks on the OSS side.

Comment by Peter Jones [ 14/Jun/12 ]

Bull now believe this to be a duplicate of LU-948 and are testing out the patch

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Last comment is that a patch was being tested.

Generated at Sat Feb 10 01:10:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.