Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8450

replay-single test 70c: mount MDS hung

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      replay-single test 70c hung while mounting MDS:

      Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      CMD: onyx-33vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      

      Console log on MDS:

      Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre                                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
      LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 683 previous similar messagesLustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469762588/real 1469762588]  req@ffff880051ebaa00 x1541153266605312/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469762613 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
      Lustre: 29062:0:(service.c:1335:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 
        req@ffff88004a8faa00 x1541153729978528/t0(0) o101->6a772ed4-43ff-dc51-4d04-2c0278989dc2@10.2.4.120@tcp:-1/-1 lens 872/3512 e 24 to 0 dl 1469763017 ref 2 fl Interpret:/0/0 rc 0/0
      Lustre: lustre-MDT0002: Client 6a772ed4-43ff-dc51-4d04-2c0278989dc2 (at 10.2.4.120@tcp) reconnecting
      Lustre: Skipped 1 previous similar message
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: Skipped 1 previous similar message
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: Skipped 3 previous similar messages
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: Skipped 6 previous similar messages
      Lustre: lustre-MDT0002: Export ffff880057b24400 already connecting from 10.2.4.120@tcp
      Lustre: Skipped 12 previous similar messages
      LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.127@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 1909 previous similar messages
      Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1469763188/real 1469763188]  req@ffff8800546a1200 x1541153266628608/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1469763213 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 6963:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
      INFO: task mdt00_002:29063 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdt00_002       D ffffffffa0b1d108     0 29063      2 0x00000080 
       ffff88004f7b3aa0 0000000000000046 ffff88004bc15c00 ffff88004f7b3fd8
       ffff88004f7b3fd8 ffff88004f7b3fd8 ffff88004bc15c00 ffffffffa0b1d100
       ffffffffa0b1d104 ffff88004bc15c00 00000000ffffffff ffffffffa0b1d108
      Call Trace:
       [<ffffffff8163cb09>] schedule_preempt_disabled+0x29/0x70
       [<ffffffff8163a805>] __mutex_lock_slowpath+0xc5/0x1c0
       [<ffffffff81639c6f>] mutex_lock+0x1f/0x2f
       [<ffffffffa0a8e024>] nodemap_add_member+0x34/0x1b0 [ptlrpc]
       [<ffffffffa0dbf161>] mdt_obd_reconnect+0x81/0x1d0 [mdt]
       [<ffffffffa09d1e6f>] target_handle_connect+0x1c4f/0x2e30 [ptlrpc]
       [<ffffffffa0a6f5f2>] tgt_request_handle+0x3f2/0x1320 [ptlrpc]
       [<ffffffffa0a1bccb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
       [<ffffffffa0a19888>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
       [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
       [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
       [<ffffffffa0a1fd80>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
       [<ffffffffa0a1f2e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
       [<ffffffff810a5aef>] kthread+0xcf/0xe0
       [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff816469d8>] ret_from_fork+0x58/0x90 
       [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
      

      Maloo reports:
      https://testing.hpdd.intel.com/test_sets/3f6a9a0e-557a-11e6-906c-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/cecb3c06-54af-11e6-a39e-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-8450] replay-single test 70c: mount MDS hung
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23120/
            Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 449fe54db666a3ad29503a55aa0f048b5f1d6543

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23120/ Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks Project: fs/lustre-release Branch: master Current Patch Set: Commit: 449fe54db666a3ad29503a55aa0f048b5f1d6543

            Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23120
            Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a6884dde68e959915dcfac6197641aa3cc27aed6

            gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/23120 Subject: LU-8450 nodemap: modify ldlm_revoke_export_locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a6884dde68e959915dcfac6197641aa3cc27aed6

            > The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about).

            Hmm, but cfs_hash_for_each_relax also returns non-zero if it is successful, no? It returns the number of items iterated over (count), so cfs_hash_for_each_empty loops until the hash is empty. As far as I can tell, neither cfs_hash_for_each_nolock nor cfs_hash_for_each_empty actually check the return value of cfs_hash_for_each_relax for the -EINTR condition described in the comment.

            Using cfs_hash_for_each_nolock in ldlm_revoke_export_locks makes more sense to me, I'll upload a patch with that change. It seems like cfs_hash_for_each_nolock should also be modified to check for -EINTR, though I'm not sure what the proper behavior is.

            kit.westneat Kit Westneat (Inactive) added a comment - > The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about). Hmm, but cfs_hash_for_each_relax also returns non-zero if it is successful, no? It returns the number of items iterated over (count), so cfs_hash_for_each_empty loops until the hash is empty. As far as I can tell, neither cfs_hash_for_each_nolock nor cfs_hash_for_each_empty actually check the return value of cfs_hash_for_each_relax for the -EINTR condition described in the comment. Using cfs_hash_for_each_nolock in ldlm_revoke_export_locks makes more sense to me, I'll upload a patch with that change. It seems like cfs_hash_for_each_nolock should also be modified to check for -EINTR, though I'm not sure what the proper behavior is.
            green Oleg Drokin added a comment -

            We are talking about the same loop.
            The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about).

            This code was used in (now removed) remote client functionality and you are the only remaining user, btw so you can mold this function to your taste.

            The conditions are a bit strange, but it was only used for MDS anyway so those could only be ibits (or flock) locks.
            If your locks are MGS - then you are using plain locks - again ok for the conditions.
            The Ungranted locks - those are supposed to become granted soon so should not be a big concern.
            the ast sent is a lock for which there is a conflict and so it should disappear soon.

            If you can figure out what is the lock that's failing those, conditions could be amended I imagine.

            green Oleg Drokin added a comment - We are talking about the same loop. The loop is external to cfs_hash_for_each_relax(), that stops on change (as is mentioned in the comment) and outside code needs to loop (the loop we are talking about). This code was used in (now removed) remote client functionality and you are the only remaining user, btw so you can mold this function to your taste. The conditions are a bit strange, but it was only used for MDS anyway so those could only be ibits (or flock) locks. If your locks are MGS - then you are using plain locks - again ok for the conditions. The Ungranted locks - those are supposed to become granted soon so should not be a big concern. the ast sent is a lock for which there is a conflict and so it should disappear soon. If you can figure out what is the lock that's failing those, conditions could be amended I imagine.

            Hi Oleg,

            I think we may be talking about different things. I believe you are talking about the loops in cfs_hash_for_each_relax, but the loop I'm talking about is in cfs_hash_for_each_empty:

                    while (cfs_hash_for_each_relax(hs, func, data, 0)) {
                            CDEBUG(D_INFO, "Try to empty hash: %s, loop: %u\n",
                                   hs->hs_name, i++);
                    }
            

            cfs_hash_for_each_relax returns the number of iterations it's done in its loops, essentially the number of items remaining in the hash. So as you can see, cfs_hash_for_each_empty loops until the hash is empty.

            The conditions I'm talking about are in the callback for ldlm_revoke_export_locks:

                    if (lock->l_req_mode != lock->l_granted_mode) {
                            unlock_res_and_lock(lock);
                            return 0;
                    }
                    if (lock->l_resource->lr_type != LDLM_IBITS &&
                        lock->l_resource->lr_type != LDLM_PLAIN) {
                            unlock_res_and_lock(lock);
                            return 0;
                    }
                    if (ldlm_is_ast_sent(lock)) {
                            unlock_res_and_lock(lock);
                            return 0;
                    }
            

            So if there is a lock in the exp_lock_hash that meets one of those three conditions, cfs_hash_for_each_empty will loop until some external process modifies that lock. My theory is that in these deadlock cases, nothing modifies the lock and the hash is never emptied, so cfs_hash_for_each_empty loops forever.

            In the original code, it iterated the lock list once and did not require that the list be emptied. I was wondering if there is a reason for using cfs_hash_for_each_empty in this case, or if it's possible to replace it with cfs_hash_for_each_nolock, which seems to be closer to what the original code does.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hi Oleg, I think we may be talking about different things. I believe you are talking about the loops in cfs_hash_for_each_relax, but the loop I'm talking about is in cfs_hash_for_each_empty: while (cfs_hash_for_each_relax(hs, func, data, 0)) { CDEBUG(D_INFO, "Try to empty hash: %s, loop: %u\n" , hs->hs_name, i++); } cfs_hash_for_each_relax returns the number of iterations it's done in its loops, essentially the number of items remaining in the hash. So as you can see, cfs_hash_for_each_empty loops until the hash is empty. The conditions I'm talking about are in the callback for ldlm_revoke_export_locks: if (lock->l_req_mode != lock->l_granted_mode) { unlock_res_and_lock(lock); return 0; } if (lock->l_resource->lr_type != LDLM_IBITS && lock->l_resource->lr_type != LDLM_PLAIN) { unlock_res_and_lock(lock); return 0; } if (ldlm_is_ast_sent(lock)) { unlock_res_and_lock(lock); return 0; } So if there is a lock in the exp_lock_hash that meets one of those three conditions, cfs_hash_for_each_empty will loop until some external process modifies that lock. My theory is that in these deadlock cases, nothing modifies the lock and the hash is never emptied, so cfs_hash_for_each_empty loops forever. In the original code, it iterated the lock list once and did not require that the list be emptied. I was wondering if there is a reason for using cfs_hash_for_each_empty in this case, or if it's possible to replace it with cfs_hash_for_each_nolock, which seems to be closer to what the original code does. Thanks, Kit
            green Oleg Drokin added a comment -

            Well, the loop is there, but for it to be going around all the time the conditions need to be met:

            • a. if rehash_key is enabled, an item can be moved from
            • one bucket to another bucket
            • b. user can remove non-zero-ref item from hash-table,
            • so the item can be removed from hash-table, even worse,
            • it's possible that user changed key and insert to another
            • hash bucket.

            So we need a constant stream of activity on the bucket then, where does that come from? If it's expected to be there then possibly cfs_hash_for_each_empty is the wrong API for you? Or you need to somehow fencethe activity in the hash while you are working on it.

            green Oleg Drokin added a comment - Well, the loop is there, but for it to be going around all the time the conditions need to be met: a. if rehash_key is enabled, an item can be moved from one bucket to another bucket b. user can remove non-zero-ref item from hash-table, so the item can be removed from hash-table, even worse, it's possible that user changed key and insert to another hash bucket. So we need a constant stream of activity on the bucket then, where does that come from? If it's expected to be there then possibly cfs_hash_for_each_empty is the wrong API for you? Or you need to somehow fencethe activity in the hash while you are working on it.

            Hi Oleg,

            Yes, that's the commit I'm talking about.

            In the older version, the list was only traversed once right? In the newer version, it loops until the hash is empty, even if there are some locks that are not processed due to the three checks at the beginning of the callback. This is the change that I don't understand, I don't know enough about the locking system to know what those checks do in regards to lock eviction.

            When I've run into similar deadlocks with D_INFO enabled, I see "Try to empty hash:" repeated essentially forever. It's possible that it's a different deadlock here though, but it seems similar.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hi Oleg, Yes, that's the commit I'm talking about. In the older version, the list was only traversed once right? In the newer version, it loops until the hash is empty, even if there are some locks that are not processed due to the three checks at the beginning of the callback. This is the change that I don't understand, I don't know enough about the locking system to know what those checks do in regards to lock eviction. When I've run into similar deadlocks with D_INFO enabled, I see "Try to empty hash:" repeated essentially forever. It's possible that it's a different deadlock here though, but it seems similar. Thanks, Kit
            green Oleg Drokin added a comment -

            Do you mean the commit 8073f0e4bef2db551a4b4bcaeb72a9986571f1bd ?

            I do not see a big logic change.
            Before we traversed the list, collected entries, accumulated them into a ast list and then sent them from ldlm_run_ast_work().

            Now we use the hash_for_each_empty() to collect all entries in the hash, fill the list in the callback and then send everything from ldlm_run_ast_work() as before.
            I am not sure what loop do you mean here. If you mean the one inside cfs_hash_for_each_empty, that one would leave very visible traces in logs if it was looping, an it would be seen in the trace too, but that's not happening?
            Also leading comment in cfs_hash_for_each_relax() describe two cases for when it needs to loop and I don't think they are happening in this case?

            green Oleg Drokin added a comment - Do you mean the commit 8073f0e4bef2db551a4b4bcaeb72a9986571f1bd ? I do not see a big logic change. Before we traversed the list, collected entries, accumulated them into a ast list and then sent them from ldlm_run_ast_work(). Now we use the hash_for_each_empty() to collect all entries in the hash, fill the list in the callback and then send everything from ldlm_run_ast_work() as before. I am not sure what loop do you mean here. If you mean the one inside cfs_hash_for_each_empty, that one would leave very visible traces in logs if it was looping, an it would be seen in the trace too, but that's not happening? Also leading comment in cfs_hash_for_each_relax() describe two cases for when it needs to loop and I don't think they are happening in this case?

            It looks like ldlm_revoke_export_locks is getting hung trying to empty the lock hash (exp->exp_lock_hash), which causes the nodemap functions to lock waiting for that to finish.

            I was looking at the original code for ldlm_revoke_export_locks, and I noticed that that code does not use hash_for_each_empty, it just sends the ASTs and finishes. There was a change in 2008 to switch the lock data structure from a list to a hash. It looks like the change to loop until the hash is empty (hash_for_each_empty) was made then. The commit message doesn't discuss this change at all, so I wonder if it was unintentional.

            Is there anyone who would know what the correct behavior is?

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - It looks like ldlm_revoke_export_locks is getting hung trying to empty the lock hash (exp->exp_lock_hash), which causes the nodemap functions to lock waiting for that to finish. I was looking at the original code for ldlm_revoke_export_locks, and I noticed that that code does not use hash_for_each_empty, it just sends the ASTs and finishes. There was a change in 2008 to switch the lock data structure from a list to a hash. It looks like the change to loop until the hash is empty (hash_for_each_empty) was made then. The commit message doesn't discuss this change at all, so I wonder if it was unintentional. Is there anyone who would know what the correct behavior is? Thanks, Kit

            People

              kit.westneat Kit Westneat (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: