[LU-8450] replay-single test 70c: mount MDS hung Created: 29/Jul/16  Updated: 22/Oct/16  Resolved: 22/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Kit Westneat
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3291 IU UID/GID Mapping Feature Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test 70c hung while mounting MDS:

Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: onyx-33vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1

Console log on MDS:

Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre                                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 683 previous similar messagesLustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469762588/real 1469762588]  req@ffff880051ebaa00 x1541153266605312/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469762613 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
Lustre: 29062:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 
  req@ffff88004a8faa00 x1541153729978528/t0(0) o101->6a772ed4-43ff-dc51-4d04-2c0278989dc2@10.2.4.120@tcp:-1/-1 lens 872/3512 e 24 to 0 dl 1469763017 ref 2 fl Interpret:/0/0 rc 0/0
Lustre: lustre-MDT0002: Client 6a772ed4-43ff-dc51-4d04-2c0278989dc2 (at 10.2.4.120@tcp) reconnecting
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 3 previous similar messages
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 6 previous similar messages
Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
Lustre: Skipped 12 previous similar messages
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1909 previous similar messages
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469763188/real 1469763188]  req@ffff8800546a1200 x1541153266628608/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469763213 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
INFO: task mdt00_002:29063 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdt00_002       D ffffffffa0b1d108     0 29063      2 0x00000080 
 ffff88004f7b3aa0 0000000000000046 ffff88004bc15c00 ffff88004f7b3fd8
 ffff88004f7b3fd8 ffff88004f7b3fd8 ffff88004bc15c00 ffffffffa0b1d100
 ffffffffa0b1d104 ffff88004bc15c00 00000000ffffffff ffffffffa0b1d108
Call Trace:
 [<ffffffff8163cb09>] schedule_preempt_disabled+0x29/0x70
 [<ffffffff8163a805>] __mutex_lock_slowpath+0xc5/0x1c0
 [<ffffffff81639c6f>] mutex_lock+0x1f/0x2f
 [<ffffffffa0a8e024>] nodemap_add_member+0x34/0x1b0 [ptlrpc]
 [<ffffffffa0dbf161>] mdt_obd_reconnect+0x81/0x1d0 [mdt]
 [<ffffffffa09d1e6f>] target_handle_connect+0x1c4f/0x2e30 [ptlrpc]
 [<ffffffffa0a6f5f2>] tgt_request_handle+0x3f2/0x1320 [ptlrpc]
 [<ffffffffa0a1bccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [<ffffffffa0a19888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
 [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
 [<ffffffffa0a1fd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [<ffffffffa0a1f2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [<ffffffff810a5aef>] kthread+0xcf/0xe0
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816469d8>] ret_from_fork+0x58/0x90 
 [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

Maloo reports:
https://testing.hpdd.intel.com/test_sets/3f6a9a0e-557a-11e6-906c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cecb3c06-54af-11e6-a39e-5254006e85c2



 Comments   
Comment by Jian Yu [ 29/Jul/16 ]

This is affecting patch review testing on master branch.

Comment by Oleg Drokin [ 01/Aug/16 ]

Seems to be some sort of a deadlock in nodemap code

Comment by Joseph Gmitter (Inactive) [ 01/Aug/16 ]

Hi Kit,

Could you please have a look at this issue? It appears that the issue is occurring in the nodemap area.

Thanks.
Joe

Comment by Kit Westneat [ 01/Aug/16 ]

It looks like ldlm_revoke_export_locks is getting hung trying to empty the lock hash (exp->exp_lock_hash), which causes the nodemap functions to lock waiting for that to finish.

I was looking at the original code for ldlm_revoke_export_locks, and I noticed that that code does not use hash_for_each_empty, it just sends the ASTs and finishes. There was a change in 2008 to switch the lock data structure from a list to a hash. It looks like the change to loop until the hash is empty (hash_for_each_empty) was made then. The commit message doesn't discuss this change at all, so I wonder if it was unintentional.

Is there anyone who would know what the correct behavior is?

Thanks,
Kit

Comment by Oleg Drokin [ 01/Sep/16 ]

Do you mean the commit 8073f0e4bef2db551a4b4bcaeb72a9986571f1bd ?

I do not see a big logic change.
Before we traversed the list, collected entries, accumulated them into a ast list and then sent them from ldlm_run_ast_work().

Now we use the hash_for_each_empty() to collect all entries in the hash, fill the list in the callback and then send everything from ldlm_run_ast_work() as before.
I am not sure what loop do you mean here. If you mean the one inside cfs_hash_for_each_empty, that one would leave very visible traces in logs if it was looping, an it would be seen in the trace too, but that's not happening?
Also leading comment in cfs_hash_for_each_relax() describe two cases for when it needs to loop and I don't think they are happening in this case?

Comment by Kit Westneat [ 01/Sep/16 ]

Hi Oleg,

Yes, that's the commit I'm talking about.

In the older version, the list was only traversed once right? In the newer version, it loops until the hash is empty, even if there are some locks that are not processed due to the three checks at the beginning of the callback. This is the change that I don't understand, I don't know enough about the locking system to know what those checks do in regards to lock eviction.

When I've run into similar deadlocks with D_INFO enabled, I see "Try to empty hash:" repeated essentially forever. It's possible that it's a different deadlock here though, but it seems similar.

Thanks,
Kit

Comment by Oleg Drokin [ 01/Sep/16 ]

Well, the loop is there, but for it to be going around all the time the conditions need to be met:

  • a. if rehash_key is enabled, an item can be moved from
  • one bucket to another bucket
  • b. user can remove non-zero-ref item from hash-table,
  • so the item can be removed from hash-table, even worse,
  • it's possible that user changed key and insert to another
  • hash bucket.

So we need a constant stream of activity on the bucket then, where does that come from? If it's expected to be there then possibly cfs_hash_for_each_empty is the wrong API for you? Or you need to somehow fencethe activity in the hash while you are working on it.

Comment by Kit Westneat [ 06/Sep/16 ]

Hi Oleg,

I think we may be talking about different things. I believe you are talking about the loops in cfs_hash_for_each_relax, but the loop I'm talking about is in cfs_hash_for_each_empty:

        while (cfs_hash_for_each_relax(hs, func, data, 0)) {
                CDEBUG(D_INFO, "Try to empty hash: %s, loop: %u\n",
                       hs->hs_name, i++);
        }

cfs_hash_for_each_relax returns the number of iterations it's done in its loops, essentially the number of items remaining in the hash. So as you can see, cfs_hash_for_each_empty loops until the hash is empty.

The conditions I'm talking about are in the callback for ldlm_revoke_export_locks:

        if (lock->l_req_mode != lock->l_granted_mode) {
                unlock_res_and_lock(lock);
                return 0;
        }
        if (lock->l_resource->lr_type != LDLM_IBITS &&
            lock->l_resource->lr_type != LDLM_PLAIN) {
                unlock_res_and_lock(lock);
                return 0;
        }
        if (ldlm_is_ast_sent(lock)) {
                unlock_res_and_lock(lock);
                return 0;
        }

So if there is a lock in the exp_lock_hash that meets one of those three conditions, cfs_hash_for_each_empty will loop until some external process modifies that lock. My theory is that in these deadlock cases, nothing modifies the lock and the hash is never emptied, so cfs_hash_for_each_empty loops forever.

In the original code, it iterated the lock list once and did not require that the list be emptied. I was wondering if there is a reason for using cfs_hash_for_each_empty in this case, or if it's possible to replace it with cfs_hash_for_each_nolock, which seems to be closer to what the original code does.

Thanks,
Kit

Comment by Oleg Drokin [ 06/Sep/16 ]

We are talking about the same loop.
The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about).

This code was used in (now removed) remote client functionality and you are the only remaining user, btw so you can mold this function to your taste.

The conditions are a bit strange, but it was only used for MDS anyway so those could only be ibits (or flock) locks.
If your locks are MGS - then you are using plain locks - again ok for the conditions.
The Ungranted locks - those are supposed to become granted soon so should not be a big concern.
the ast sent is a lock for which there is a conflict and so it should disappear soon.

If you can figure out what is the lock that's failing those, conditions could be amended I imagine.

Comment by Kit Westneat [ 11/Oct/16 ]

> The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about).

Hmm, but cfs_hash_for_each_relax also returns non-zero if it is successful, no? It returns the number of items iterated over (count), so cfs_hash_for_each_empty loops until the hash is empty. As far as I can tell, neither cfs_hash_for_each_nolock nor cfs_hash_for_each_empty actually check the return value of cfs_hash_for_each_relax for the -EINTR condition described in the comment.

Using cfs_hash_for_each_nolock in ldlm_revoke_export_locks makes more sense to me, I'll upload a patch with that change. It seems like cfs_hash_for_each_nolock should also be modified to check for -EINTR, though I'm not sure what the proper behavior is.

Comment by Gerrit Updater [ 12/Oct/16 ]

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23120
Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6884dde68e959915dcfac6197641aa3cc27aed6

Comment by Gerrit Updater [ 21/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23120/
Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 449fe54db666a3ad29503a55aa0f048b5f1d6543

Comment by Peter Jones [ 22/Oct/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:17:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.