[LU-8450] replay-single test 70c: mount MDS hung Created: 29/Jul/16 Updated: 22/Oct/16 Resolved: 22/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | Kit Westneat |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
replay-single test 70c hung while mounting MDS: Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: onyx-33vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 Console log on MDS: Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 683 previous similar messagesLustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469762588/real 1469762588] req@ffff880051ebaa00 x1541153266605312/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469762613 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 13 previous similar messages Lustre: 29062:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply req@ffff88004a8faa00 x1541153729978528/t0(0) o101->6a772ed4-43ff-dc51-4d04-2c0278989dc2@10.2.4.120@tcp:-1/-1 lens 872/3512 e 24 to 0 dl 1469763017 ref 2 fl Interpret:/0/0 rc 0/0 Lustre: lustre-MDT0002: Client 6a772ed4-43ff-dc51-4d04-2c0278989dc2 (at 10.2.4.120@tcp) reconnecting Lustre: Skipped 1 previous similar message Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: Skipped 1 previous similar message Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: Skipped 3 previous similar messages Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: Skipped 6 previous similar messages Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp Lustre: Skipped 12 previous similar messages LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 1909 previous similar messages Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469763188/real 1469763188] req@ffff8800546a1200 x1541153266628608/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469763213 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 19 previous similar messages INFO: task mdt00_002:29063 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdt00_002 D ffffffffa0b1d108 0 29063 2 0x00000080 ffff88004f7b3aa0 0000000000000046 ffff88004bc15c00 ffff88004f7b3fd8 ffff88004f7b3fd8 ffff88004f7b3fd8 ffff88004bc15c00 ffffffffa0b1d100 ffffffffa0b1d104 ffff88004bc15c00 00000000ffffffff ffffffffa0b1d108 Call Trace: [<ffffffff8163cb09>] schedule_preempt_disabled+0x29/0x70 [<ffffffff8163a805>] __mutex_lock_slowpath+0xc5/0x1c0 [<ffffffff81639c6f>] mutex_lock+0x1f/0x2f [<ffffffffa0a8e024>] nodemap_add_member+0x34/0x1b0 [ptlrpc] [<ffffffffa0dbf161>] mdt_obd_reconnect+0x81/0x1d0 [mdt] [<ffffffffa09d1e6f>] target_handle_connect+0x1c4f/0x2e30 [ptlrpc] [<ffffffffa0a6f5f2>] tgt_request_handle+0x3f2/0x1320 [ptlrpc] [<ffffffffa0a1bccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [<ffffffffa0a19888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 [<ffffffffa0a1fd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] [<ffffffffa0a1f2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Maloo reports: |
| Comments |
| Comment by Jian Yu [ 29/Jul/16 ] |
|
This is affecting patch review testing on master branch. |
| Comment by Oleg Drokin [ 01/Aug/16 ] |
|
Seems to be some sort of a deadlock in nodemap code |
| Comment by Joseph Gmitter (Inactive) [ 01/Aug/16 ] |
|
Hi Kit, Could you please have a look at this issue? It appears that the issue is occurring in the nodemap area. Thanks. |
| Comment by Kit Westneat [ 01/Aug/16 ] |
|
It looks like ldlm_revoke_export_locks is getting hung trying to empty the lock hash (exp->exp_lock_hash), which causes the nodemap functions to lock waiting for that to finish. I was looking at the original code for ldlm_revoke_export_locks, and I noticed that that code does not use hash_for_each_empty, it just sends the ASTs and finishes. There was a change in 2008 to switch the lock data structure from a list to a hash. It looks like the change to loop until the hash is empty (hash_for_each_empty) was made then. The commit message doesn't discuss this change at all, so I wonder if it was unintentional. Is there anyone who would know what the correct behavior is? Thanks, |
| Comment by Oleg Drokin [ 01/Sep/16 ] |
|
Do you mean the commit 8073f0e4bef2db551a4b4bcaeb72a9986571f1bd ? I do not see a big logic change. Now we use the hash_for_each_empty() to collect all entries in the hash, fill the list in the callback and then send everything from ldlm_run_ast_work() as before. |
| Comment by Kit Westneat [ 01/Sep/16 ] |
|
Hi Oleg, Yes, that's the commit I'm talking about. In the older version, the list was only traversed once right? In the newer version, it loops until the hash is empty, even if there are some locks that are not processed due to the three checks at the beginning of the callback. This is the change that I don't understand, I don't know enough about the locking system to know what those checks do in regards to lock eviction. When I've run into similar deadlocks with D_INFO enabled, I see "Try to empty hash:" repeated essentially forever. It's possible that it's a different deadlock here though, but it seems similar. Thanks, |
| Comment by Oleg Drokin [ 01/Sep/16 ] |
|
Well, the loop is there, but for it to be going around all the time the conditions need to be met:
So we need a constant stream of activity on the bucket then, where does that come from? If it's expected to be there then possibly cfs_hash_for_each_empty is the wrong API for you? Or you need to somehow fencethe activity in the hash while you are working on it. |
| Comment by Kit Westneat [ 06/Sep/16 ] |
|
Hi Oleg, I think we may be talking about different things. I believe you are talking about the loops in cfs_hash_for_each_relax, but the loop I'm talking about is in cfs_hash_for_each_empty: while (cfs_hash_for_each_relax(hs, func, data, 0)) { CDEBUG(D_INFO, "Try to empty hash: %s, loop: %u\n", hs->hs_name, i++); } cfs_hash_for_each_relax returns the number of iterations it's done in its loops, essentially the number of items remaining in the hash. So as you can see, cfs_hash_for_each_empty loops until the hash is empty. The conditions I'm talking about are in the callback for ldlm_revoke_export_locks: if (lock->l_req_mode != lock->l_granted_mode) { unlock_res_and_lock(lock); return 0; } if (lock->l_resource->lr_type != LDLM_IBITS && lock->l_resource->lr_type != LDLM_PLAIN) { unlock_res_and_lock(lock); return 0; } if (ldlm_is_ast_sent(lock)) { unlock_res_and_lock(lock); return 0; } So if there is a lock in the exp_lock_hash that meets one of those three conditions, cfs_hash_for_each_empty will loop until some external process modifies that lock. My theory is that in these deadlock cases, nothing modifies the lock and the hash is never emptied, so cfs_hash_for_each_empty loops forever. In the original code, it iterated the lock list once and did not require that the list be emptied. I was wondering if there is a reason for using cfs_hash_for_each_empty in this case, or if it's possible to replace it with cfs_hash_for_each_nolock, which seems to be closer to what the original code does. Thanks, |
| Comment by Oleg Drokin [ 06/Sep/16 ] |
|
We are talking about the same loop. This code was used in (now removed) remote client functionality and you are the only remaining user, btw so you can mold this function to your taste. The conditions are a bit strange, but it was only used for MDS anyway so those could only be ibits (or flock) locks. If you can figure out what is the lock that's failing those, conditions could be amended I imagine. |
| Comment by Kit Westneat [ 11/Oct/16 ] |
|
> The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about). Hmm, but cfs_hash_for_each_relax also returns non-zero if it is successful, no? It returns the number of items iterated over (count), so cfs_hash_for_each_empty loops until the hash is empty. As far as I can tell, neither cfs_hash_for_each_nolock nor cfs_hash_for_each_empty actually check the return value of cfs_hash_for_each_relax for the -EINTR condition described in the comment. Using cfs_hash_for_each_nolock in ldlm_revoke_export_locks makes more sense to me, I'll upload a patch with that change. It seems like cfs_hash_for_each_nolock should also be modified to check for -EINTR, though I'm not sure what the proper behavior is. |
| Comment by Gerrit Updater [ 12/Oct/16 ] |
|
Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23120 |
| Comment by Gerrit Updater [ 21/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23120/ |
| Comment by Peter Jones [ 22/Oct/16 ] |
|
Landed for 2.9 |