Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8936

Client LBUGs with cl_object.c:735:cl_env_attach() ASSERTION( rc == 0 ) in process ldlm_bl_02

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • None
    • lustre-2.8.0_5.chaos-2.ch6.x86_64
      kernel 3.10.0-514.0.0.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      Console reports first this:

      LustreError: 7526:0:(cl_object.c:735:cl_env_attach()) ASSERTION( rc == 0 ) failed:
      LustreError: 7526:0:(cl_object.c:735:cl_env_attach()) LBUG
      Pid: 7526, comm: ldlm_bl_02
      
      Call Trace:
       libcfs_debug_dumpstack+0x53/0x80 [libcfs]
       lbug_with_loc+0x45/0xc0 [libcfs]
       cl_env_percpu_get+0xc2/0xd0 [obdclass]
       ll_invalidatepage+0x41/0x170 [lustre]
       vvp_page_discard+0xbd/0x160 [lustre]
       cl_page_invoid+0x68/0x170 [obdclass]
       cl_page_discard+0x13/0x20 [obdclass]
       discard_cb+0x67/0x190 [osc]
       osc_page_gang_lookup+0x1e0/0x320 [osc]
       ? discard_cb+0x0/0x190 [osc]
       osc_lock_discard_pages+0x119/0x22d [osc]
       ? discard_cb+0x0/0x190 [osc]
       osc_lock_flush+0x89/0x280 [osc]
       osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       ldlm_cancel_callback+0x8a/0x2e0 [ptlrpc]
       ? dequeue_entity+0x11c/0x5d0
       ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
       ldlm_cli_cancel+0xab/0x3d0 [ptlrpc]
       osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       ? __schedule+0x3b8/0x9c0
       ldlm_handle_bl_callback+0xcf/0x410 [ptlrpc]
       ldlm_bl_thread_main+0x531/0x700 [ptlrpc]
       ? default_wake_function+0x0/0x20
       ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc]
       kthread+0xcf/0xe0
       ? kthread+0x0/0xe0
       ret_from_fork+0x58/0x90
       ? kthread+0x0/0xe0
      

      is followed by

      BUG: sleeping function called from invalid context at mm/slub.c:941
      in_atomic(): 1, irqs_disabled(): 0, pid: 7526, name: ldlm_bl_02
      CPU: 23 PID: 7526 Comm: ldlm_bl_02 Tainted: G           OE  ------------   3.10.0-514.0.0.1chaos.ch6.x86_64 #1
      Call Trace:
       dump_stack+0x19/0x1b
       __might_sleep+0xd9/0x100
       kmem_cache_alloc_trace+0x4a/0x250
       ? call_usermodehelper_setup+0x3f/0xa0
       call_usermodehelper_setup+0x3f/0xa0
       call_usermodehelper+0x31/0x60
       libcfs_run_upcall+0x9e/0x3b0 [libcfs]
       ? snprintf+0x49/0x70
       libcfs_run_lbug_upcall+0x7d/0x100 [libcfs]
       lbug_with_loc+0x57/0xc0 [libcfs]
       cl_env_percpu_get+0xc2/0xd0 [obdclass]
       ll_invalidatepage+0x41/0x170 [lustre]
       vvp_page_discard+0xbd/0x160 [lustre]
       cl_page_invoid+0x68/0x170 [obdclass]
       cl_page_discard+0x13/0x20 [obdclass]
       discard_cb+0x67/0x190 [osc]
       osc_page_gang_lookup+0x1e0/0x320 [osc]
       ? check_and_discard_cb+0x150/0x150 [osc]
       osc_lock_discard_pages+0x119/0x22d [osc]
       ? check_and_discard_cb+0x150/0x150 [osc]
       osc_lock_flush+0x89/0x280 [osc]
       osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
       ldlm_cancel_callback+0x8a/0x2e0 [ptlrpc]
       ? dequeue_entity+0x11c/0x5d0
       ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
       ldlm_cli_cancel+0xab/0x3d0 [ptlrpc]
       osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
       ? __schedule+0x3b8/0x9c0
       ldlm_handle_bl_callback+0xcf/0x410 [ptlrpc]
       ldlm_bl_thread_main+0x531/0x700 [ptlrpc]
       ? wake_up_state+0x20/0x20
       ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
       kthread+0xcf/0xe0
       ? kthread_create_on_node+0x140/0x140
       ret_from_fork+0x58/0x90
       ? kthread_create_on_node+0x140/0x140
      

      I'm not certain whether there is a particular workload that triggers this. We've been running concurrent mdtest and ior jobs, using remote directories but not striped directories.

      The frequency is high; running on 300 clients for about 2 hours triggered this bug in 1/3 of the nodes.

      Attachments

        Issue Links

          Activity

            [LU-8936] Client LBUGs with cl_object.c:735:cl_env_attach() ASSERTION( rc == 0 ) in process ldlm_bl_02

            Pulling in something like LU-4257 is not the sort of thing we should generally be doing on our stable branch.

            morrone Christopher Morrone (Inactive) added a comment - Pulling in something like LU-4257 is not the sort of thing we should generally be doing on our stable branch.
            ofaaland Olaf Faaland added a comment -

            Jinshan,

            If we chose instead to pull in LU-4257 and keep LU-8509, are there any other changes you are aware of that we would need?

            ofaaland Olaf Faaland added a comment - Jinshan, If we chose instead to pull in LU-4257 and keep LU-8509 , are there any other changes you are aware of that we would need?
            ofaaland Olaf Faaland added a comment -

            Jinshan,

            I saw that change but it looks to me like the only way to get to out, with page non-NULL, is if page already tested for IS_ERR(). But even if I'm correct a future change could alter that, so I guess the code makes sense either way.

            And I now see that the initialization of page does matter in some cases, like ll_cl_find() returns NULL.

            So OK, this seems right. Thank you.

            ofaaland Olaf Faaland added a comment - Jinshan, I saw that change but it looks to me like the only way to get to out, with page non-NULL, is if page already tested for IS_ERR(). But even if I'm correct a future change could alter that, so I guess the code makes sense either way. And I now see that the initialization of page does matter in some cases, like ll_cl_find() returns NULL. So OK, this seems right. Thank you.

            The reset of LU-8509 fixed an actual problem:

            @@ -721,6 +721,10 @@ out:
                                    unlock_page(vmpage);
                                    put_page(vmpage);
                            }
            +               if (!IS_ERR_OR_NULL(page)) {
            +                       lu_ref_del(&page->cp_reference, "cl_io", io);
            +                       cl_page_put(env, page);
            +               }
            

            Where a cl_page was leaked in some scenarios.

            Yes, the major benefit of LU-4257 is the support of fast read, which can improve small I/O read performance significantly.

            jay Jinshan Xiong (Inactive) added a comment - The reset of LU-8509 fixed an actual problem: @@ -721,6 +721,10 @@ out: unlock_page(vmpage); put_page(vmpage); } + if (!IS_ERR_OR_NULL(page)) { + lu_ref_del(&page->cp_reference, "cl_io" , io); + cl_page_put(env, page); + } Where a cl_page was leaked in some scenarios. Yes, the major benefit of LU-4257 is the support of fast read, which can improve small I/O read performance significantly.
            ofaaland Olaf Faaland added a comment -

            Jinshan,
            Thanks for the explanation. We landed patch 24351 to our patch stack and the high frequency LBUGs have stopped.

            Is there any reason for us not to just revert the change from LU-8509 in its entirety? It looks to me like there were a few bits not reverted by your patch, but that they have no effect - things like initializing a variable that gets set before it is read anyway.

            I looked at LU-4257. For the benefit of anyone else watching this ticket, LU-4257 has 5 changes, collectively alter several hundred lines of code, but produced very large performance improvements, particularly for small file IO.

            ofaaland Olaf Faaland added a comment - Jinshan, Thanks for the explanation. We landed patch 24351 to our patch stack and the high frequency LBUGs have stopped. Is there any reason for us not to just revert the change from LU-8509 in its entirety? It looks to me like there were a few bits not reverted by your patch, but that they have no effect - things like initializing a variable that gets set before it is read anyway. I looked at LU-4257 . For the benefit of anyone else watching this ticket, LU-4257 has 5 changes, collectively alter several hundred lines of code, but produced very large performance improvements, particularly for small file IO.

            In LU-4257, I made a patch(commit 4533271) to get rid of cl_env hash table on the client side. From now on, cl_env won't be attached to processes. This made it possible to use cl_env_percpu_get() in ll_invalidatepage(). However, LLNL FE branch it doesn't include the fix for LU-4257 but does include LU-8509, which caused the problem on this ticket because an env is already attached to the process when it calls ll_invalidatepage().

            To make it work, either we can land the corresponding page from LU-4257 to LLNL FE branch, or we can revert part of the change introduced by LU-8509. For simplicity, I chose to revert it.

            Please push patch 24351 to LLNL FE branch.

            jay Jinshan Xiong (Inactive) added a comment - In LU-4257 , I made a patch(commit 4533271) to get rid of cl_env hash table on the client side. From now on, cl_env won't be attached to processes. This made it possible to use cl_env_percpu_get() in ll_invalidatepage(). However, LLNL FE branch it doesn't include the fix for LU-4257 but does include LU-8509 , which caused the problem on this ticket because an env is already attached to the process when it calls ll_invalidatepage(). To make it work, either we can land the corresponding page from LU-4257 to LLNL FE branch, or we can revert part of the change introduced by LU-8509 . For simplicity, I chose to revert it. Please push patch 24351 to LLNL FE branch.

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/24351
            Subject: LU-8936 llite: use percpu env correctly in ll_invalidatepage
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 607744846d42d4f1199aa5172d7bd4ee3f8f044f

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/24351 Subject: LU-8936 llite: use percpu env correctly in ll_invalidatepage Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 607744846d42d4f1199aa5172d7bd4ee3f8f044f

            This is a problem introduced by LU-8509. We can fix it by reverting part of the patch.

            jay Jinshan Xiong (Inactive) added a comment - This is a problem introduced by LU-8509 . We can fix it by reverting part of the patch.
            ofaaland Olaf Faaland added a comment -

            Jinshan,
            See the lustre-release-fe-llnl repo on your gerritt server.

            ofaaland Olaf Faaland added a comment - Jinshan, See the lustre-release-fe-llnl repo on your gerritt server.

            Ned, can you please point the code repository you're using?

            jay Jinshan Xiong (Inactive) added a comment - Ned, can you please point the code repository you're using?

            Unfortunately we don't have kdump working on this cluster which has made it hard to debug this problem.

            (gdb) l *(ll_invalidatepage+0x41)
            0x53121 is in ll_invalidatepage (/usr/src/debug/lustre-2.8.0_5.chaos/lustre/llite/rw26.c:106).
            101     #else
            102             if (offset == 0) {
            103     #endif
            104                     /* See the comment in ll_releasepage() */
            105                     env = cl_env_percpu_get();
            106                     LASSERT(!IS_ERR(env));
            107
            108                     inode = vmpage->mapping->host;
            109                     obj = ll_i2info(inode)->lli_clob;
            110                     if (obj != NULL) {
            (gdb) 
            
            nedbass Ned Bass (Inactive) added a comment - Unfortunately we don't have kdump working on this cluster which has made it hard to debug this problem. (gdb) l *(ll_invalidatepage+0x41) 0x53121 is in ll_invalidatepage (/usr/src/debug/lustre-2.8.0_5.chaos/lustre/llite/rw26.c:106). 101 #else 102 if (offset == 0) { 103 #endif 104 /* See the comment in ll_releasepage() */ 105 env = cl_env_percpu_get(); 106 LASSERT(!IS_ERR(env)); 107 108 inode = vmpage->mapping->host; 109 obj = ll_i2info(inode)->lli_clob; 110 if (obj != NULL) { (gdb)

            People

              jay Jinshan Xiong (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: