[LU-816] Possible bug/dead-lock in Lustre-Lock algorithm/protocol may lead to multiple Clients/processes to blocked for ever Created: 02/Nov/11 Updated: 08/Mar/14 Resolved: 08/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Lustre Bull | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 8545 |
| Description |
|
Hi, Several Bull customers (CEA, TGCC,...) are reporting error messages exactly as described in Here are traces collected by our on site support on a customer site: Users reported hung applications/jobs, mainly in Slurm's "Completing" state. Logs on affected Clients/nodes have plenty of : "LutreError: 11-0: an error occurred while communicating with <OSS_nid>. The ost_connect operation failed with -16" msgs. To find the details of the failing connection on the Client side we use : # grep current /proc/fs/lustre/osc/*/state | grep -v FULL -->> one OST connection will show q "CONNECTING" state. Then on the identified OSS/Server, we find a lot of the following msgs for the original Client and sometimes also others : "Lustre: <pid:0>:(ldlm_lib.c:841:target_handle_connect()) <OST-name>: refuse reconnection from <Client_nid>@<portal> to 0x..." "LustreError: <pid:0>:(ldlm_lib.c:2123:target_send_reply_msg()) @@@ processing error (-16) ...." on/in the same OSS/log there also messages of the type : "Lustre: <pid:0>:(client.c:1763:ptlrpc_expire_one_request()) @@@ Request ... sent from <OST_name> to NID <other_Client_nid>@<portal> has timed out for slow reply ...". On the other/new identified Client, logs contain repeating msgs of the type : "Lustre: <pid:0>:(service.c:1040:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150) , not sending early reply" #consequences: No other way to unblock the situation than to crash/dump the other/new identified Client !!! #details: To come in further comments/add-ons !! |
| Comments |
| Comment by Peter Jones [ 02/Nov/11 ] |
|
Fan Yong Could you please advise whether this is the same issue as previously observed? Thanks Peter |
| Comment by Aurelien Degremont (Inactive) [ 03/Nov/11 ] |
|
As the patch in My understanding is that the same kind of LDLM dead lock appeared here. That's the reason why the same kind of messages are shown, but the problem seems different. Fan Yong, please tell me if I misunderstood the problem. |
| Comment by nasf (Inactive) [ 03/Nov/11 ] |
|
I am sure it is not the same issue as |
| Comment by Lustre Bull [ 08/Nov/11 ] |
|
Below is the analysis done by on line support:
And these are the significant stacks I found on the Client owner of the Lustre-lock !!...
All the "callback" threads ("ldlm_cb_<id>") were stuck since a long time with the same kind of following stack :
===================================================
schedule()
__mutex_lock_slowpath()
mutex_lock()
cl_lock_mutex_get()
osc_ldlm_glimpse_ast()
ldlm_callback_handler()
ptlrpc_server_handle_request()
ptlrpc_main()
kernel_thread()
===================================================
and/when only one "ldlm_bl_<id>" thread is stuck with the following stack :
===================================================
schedule()
io_schedule()
sync_page()
__wait_on_bit_lock()
__lock_page()
vvp_page_own()
cl_page_own0()
cl_page_own()
cl_page_gang_lookup()
cl_lock_page_out()
osc_lock_flush()
osc_lock_cancel()
cl_lock_cancel0()
cl_lock_cancel()
osc_ldlm_blocking()
ldlm_handle_bl_callback()
ldlm_bl_thread_main()
kernel_thread()
===================================================
So seems that this "ldlm_bl_<id>" thread started to cancel the Lustre-lock and flush its associated pages, when the Server/OSS decided to reclaim it (thus the "ldlm_cb_<id>" threads) and we encountered some kind of a remote dead-lock situation also impacting/blocking other Clients trying to grant this same lock ...
|
| Comment by nasf (Inactive) [ 08/Nov/11 ] |
|
=================================================== Above stack means that all the glimpse callbacks (ldlm_cb_xx) were blocked when try to acquire the mutex on some cl_lock on the client. Because such mutex was held by the ldlm_bl_xx thread which was in cancel such cl_lock. On the other hand, the ldlm_bl_xx thread was trying to flush dirty pages to OSS before canceling the cl_lock, but it was hung with mutex held for some unknown reason. Then it seemed that all the threads on client-side were hung. Currently, because we do not have lustre debug logs, it is not easy to say why the sync_page is blocked. But there are some similar bugs we found before, one of them is |
| Comment by Sebastien Buisson (Inactive) [ 10/Nov/11 ] |
|
Hi, As the system is in production, we would like to be sure of the fix before we propose it to the customer. Moreover, do you think it is OK if debug logs are activated only after the problem has occurred? Cheers, |
| Comment by nasf (Inactive) [ 16/Nov/11 ] |
|
Firstly, please check whether quota is activated or not on the system. If not, it will not be Anyway, "-1" level debug is better for both client and OSS during the system run. Activating such debug after the problem can help very little. Currently, I suspect it is On the other hand, can you give me the client-side all threads stacks log when the problem occurred? You can get them by "echo t > /proc/sysrq-trigger". Thanks! |
| Comment by Patrick Valentin (Inactive) [ 28/Nov/11 ] |
|
Hi,
|
| Comment by nasf (Inactive) [ 01/Dec/11 ] |
|
From your log, it is obviously that all the hung "ldlm_cb_xxx" are because of "osc_ldlm_glimpse_ast()" blocked by "cl_lock_mutex_get()" on the cl_lock. Such mutex is held by "poncetr_%%A78_1" which is trying to cancel the cl_lock with the mutex held. But for some unknown reason, the cl_lock cancel cannot finish. I have some concern about the possible deadlock: if all the service threads on OST are in processing glimpse_ast(), but the glimpse_ast() is blocked by client-side mutex get as described above. Then when lock cancel RPC comes to OST, what will happen? If it has to wait, then deadlock. Jay, I am not quite sure for that, please comment. And I also doubt that whether glimpse_ast should be blocked on client-side? it is false for b1_8. |
| Comment by Jinshan Xiong (Inactive) [ 02/Dec/11 ] |
|
it looks like the lock is being canceled but it was blocked by locking a page. There are several clio issues fixed in 2.1 release. Can you please tell what patches you have applied for this customer? |
| Comment by Patrick Valentin (Inactive) [ 12/Jan/12 ] |
|
Here is the list of patches that were present in the customer lustre release. bz16919 The Jira tickets integrated in the next Bull efix deliveries since october 4, 2011 are the following: |
| Comment by Jinshan Xiong (Inactive) [ 13/Jan/12 ] |
|
From unknown reason, the client had difficulties to grab page lock when it was canceling a lock. How often do you guys see this problem? If possible, I'd like to take a look at the kernel log on the OSS side, especially to see if there exists eviction messages. Thanks. |
| Comment by Patrick Valentin (Inactive) [ 29/May/12 ] |
|
On site support reports that the problem did not occur again since the installation of the efix containing |
| Comment by Peter Jones [ 29/May/12 ] |
|
ok thanks Patrick |
| Comment by Lustre Bull [ 12/Jun/12 ] |
|
The Lustre 2.1.1 Bull release containing |
| Comment by Peter Jones [ 14/Jun/12 ] |
|
Bull now believe this to be a duplicate of |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Last comment is that a patch was being tested. |