Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2807

lockup in server completion ast -> lu_object_find_at

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 6794

    Description

      Running racer I hit a problem multiple times where on completion AST the callback gets stuck looking for some object.
      Alex thinks it's a not fully fixed race vs object deletion of some sort.
      The stack trace looks like this:

      [175924.328073] INFO: task ptlrpc_hr01_003:16414 blocked for more than 120 seconds.
      [175924.328610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [175924.329108] ptlrpc_hr01_0 D 0000000000000006  3952 16414      2 0x00000000
      [175924.329432]  ffff880076a19920 0000000000000046 0000000000000040 0000000000000286
      [175924.329950]  ffff880076a198a0 0000000000000286 0000000000000286 ffffc9000376b040
      [175924.330457]  ffff8800573a67b8 ffff880076a19fd8 000000000000fba8 ffff8800573a67b8
      [175924.330950] Call Trace:
      [175924.331191]  [<ffffffffa0743c36>] ? htable_lookup+0x1a6/0x1c0 [obdclass]
      [175924.331505]  [<ffffffffa041e79e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [175924.331807]  [<ffffffffa0744243>] lu_object_find_at+0xb3/0x360 [obdclass]
      [175924.332104]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [175924.332403]  [<ffffffffa07413df>] ? keys_fill+0x6f/0x190 [obdclass]
      [175924.332746]  [<ffffffffa0744506>] lu_object_find+0x16/0x20 [obdclass]
      [175924.333035]  [<ffffffffa0549ea6>] mdt_object_find+0x56/0x170 [mdt]
      [175924.333398]  [<ffffffffa0586e63>] mdt_lvbo_fill+0x2f3/0x800 [mdt]
      [175924.333715]  [<ffffffffa0845c1a>] ldlm_server_completion_ast+0x18a/0x640 [ptlrpc]
      [175924.334204]  [<ffffffffa0845a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
      [175924.334655]  [<ffffffffa081bbdc>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc]
      [175924.334976]  [<ffffffffa085c18f>] ptlrpc_set_wait+0x6f/0x880 [ptlrpc]
      [175924.335264]  [<ffffffff81090154>] ? __init_waitqueue_head+0x24/0x40
      [175924.335559]  [<ffffffffa041e8a5>] ? cfs_waitq_init+0x15/0x20 [libcfs]
      [175924.335867]  [<ffffffffa085876e>] ? ptlrpc_prep_set+0x11e/0x300 [ptlrpc]
      [175924.336134]  [<ffffffffa081bb10>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc]
      [175924.336444]  [<ffffffffa081e19b>] ldlm_run_ast_work+0x1db/0x460 [ptlrpc]
      [175924.336767]  [<ffffffffa081eda4>] ldlm_reprocess_all+0x114/0x300 [ptlrpc]
      [175924.337067]  [<ffffffffa08372e3>] ldlm_cli_cancel_local+0x2b3/0x470 [ptlrpc]
      [175924.337445]  [<ffffffffa083bbab>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
      [175924.337719]  [<ffffffffa083bf42>] ldlm_blocking_ast_nocheck+0x92/0x320 [ptlrpc]
      [175924.338177]  [<ffffffffa0819070>] ? lock_res_and_lock+0x30/0x50 [ptlrpc]
      [175924.338464]  [<ffffffffa0549d40>] mdt_blocking_ast+0x190/0x2a0 [mdt]
      [175924.338759]  [<ffffffffa042e401>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [175924.339051]  [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10
      [175924.339339]  [<ffffffffa083f950>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      [175924.339814]  [<ffffffffa0820cc6>] ldlm_lock_decref_internal+0x426/0xc80 [ptlrpc]
      [175924.340282]  [<ffffffff814faf3e>] ? _spin_unlock+0xe/0x10
      [175924.340614]  [<ffffffffa0712217>] ? class_handle2object+0x97/0x170 [obdclass]
      [175924.341175]  [<ffffffffa0821f49>] ldlm_lock_decref+0x39/0x90 [ptlrpc]
      [175924.341527]  [<ffffffffa087112b>] ptlrpc_hr_main+0x39b/0x760 [ptlrpc]
      [175924.341824]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [175924.342141]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
      [175924.342444]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [175924.342734]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
      [175924.343068]  [<ffffffffa0870d90>] ? ptlrpc_hr_main+0x0/0x760 [ptlrpc]
      [175924.343376]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-2807] lockup in server completion ast -> lu_object_find_at
            jay Jinshan Xiong (Inactive) added a comment - Will be fixed in LU-3124 with patch http://review.whamcloud.com/6042

            No, I did not mean the problem is in layout-swap/lvb but that it was put back to front due to it. We already agreed it is an old/known problem/race between unlink and getattr. I just wanted to comment on the fact that now you think the best place to handle and fix this is in mdt_lvbo_fill() where I was pointing that the extra-lookup causing the hung situation is.

            bfaccini Bruno Faccini (Inactive) added a comment - No, I did not mean the problem is in layout-swap/lvb but that it was put back to front due to it. We already agreed it is an old/known problem/race between unlink and getattr. I just wanted to comment on the fact that now you think the best place to handle and fix this is in mdt_lvbo_fill() where I was pointing that the extra-lookup causing the hung situation is.

            So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ?

            This is not an issue about layout-swap or something. Maybe I missed something in our previous conversation

            jay Jinshan Xiong (Inactive) added a comment - So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ? This is not an issue about layout-swap or something. Maybe I missed something in our previous conversation

            frankly, I can't say this is very nice solution.. and I don't think one more RPC to fetch LOV after data restore is such a big problem.

            bzzz Alex Zhuravlev added a comment - frankly, I can't say this is very nice solution.. and I don't think one more RPC to fetch LOV after data restore is such a big problem.

            So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ?

            bfaccini Bruno Faccini (Inactive) added a comment - So finally, you changed your mind and will fix it on the LVB/layout-swap side as we were discussing before ?

            I'm going to fix this issue by finding a field in ldlm_lock, say l_tree_node, to store mdt_object, if it's an intent operation which find the object firstly and then request dlm lock. So in mdt_lvbo_fill(), it only calls mdt_object_find() if it's NULL.

            jay Jinshan Xiong (Inactive) added a comment - I'm going to fix this issue by finding a field in ldlm_lock, say l_tree_node, to store mdt_object, if it's an intent operation which find the object firstly and then request dlm lock. So in mdt_lvbo_fill(), it only calls mdt_object_find() if it's NULL.
            jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/5911

            I mean we can declare a new function, say: mdt_object_lookup() which will lookup the hash table and make sure the object exists in the cache. In mdt_object_lookup(), it also calls lu_object_find(), but with a new flags in lu_object_conf, say: LOC_F_LOOKUP. With this flag, lu_object_find() will look up hash table only, and of course, if the object is dying it will return -ENOENT.

            This assumes that the object must have been referenced by someone. For getattr intent request, this is true. However we need to check other code path to make sure.

            jay Jinshan Xiong (Inactive) added a comment - I mean we can declare a new function, say: mdt_object_lookup() which will lookup the hash table and make sure the object exists in the cache. In mdt_object_lookup(), it also calls lu_object_find(), but with a new flags in lu_object_conf, say: LOC_F_LOOKUP. With this flag, lu_object_find() will look up hash table only, and of course, if the object is dying it will return -ENOENT. This assumes that the object must have been referenced by someone. For getattr intent request, this is true. However we need to check other code path to make sure.

            Thank's Jinshan, so for you problem has not been introduced by LVB/layout-swap changes but only highlighted.

            And the fix you suggest is to give getattr the mean to detect unlink occurred and object is dying with a new lu_object_lookup() method, just after it acquired the "inodebit dlm lock" and return ENOENT if object is dying ?

            bfaccini Bruno Faccini (Inactive) added a comment - Thank's Jinshan, so for you problem has not been introduced by LVB/layout-swap changes but only highlighted. And the fix you suggest is to give getattr the mean to detect unlink occurred and object is dying with a new lu_object_lookup() method, just after it acquired the "inodebit dlm lock" and return ENOENT if object is dying ?

            I think this is a race between unlink and getattr. Let's make up a test case for this race, say:
            1. client1 unlink reaches the MDT;
            2. before unlink enqueues lock, client2 tries to send a getattr intent req;
            3. unlink acquires inodebits dlm lock;
            4. before unlink releases the lock, getattr comes to acquire the lock, blocked;
            5. unlink finishes and releases the lock, getattr's completion_ast will be invoked;
            6. this problem should be reproduced.

            If this is the case, we can work out a lu_object_lookup() and if the object is already killed or not existed, -ENOENT should be returned; then -ENOENT should be returned to getattr intent request too.

            jay Jinshan Xiong (Inactive) added a comment - I think this is a race between unlink and getattr. Let's make up a test case for this race, say: 1. client1 unlink reaches the MDT; 2. before unlink enqueues lock, client2 tries to send a getattr intent req; 3. unlink acquires inodebits dlm lock; 4. before unlink releases the lock, getattr comes to acquire the lock, blocked; 5. unlink finishes and releases the lock, getattr's completion_ast will be invoked; 6. this problem should be reproduced. If this is the case, we can work out a lu_object_lookup() and if the object is already killed or not existed, -ENOENT should be returned; then -ENOENT should be returned to getattr intent request too.

            People

              jay Jinshan Xiong (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: