Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17793

warning: objtool: __cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set()

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      While building master branch on Ubuntu 24.04 with kernel 6.8.0-31-generic, there are lots of objtool warnings as follows:

        CC [M]  /root/lustre-release/libcfs/libcfs/fail.o
      /root/lustre-release/libcfs/libcfs/fail.o: warning: objtool: __cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set()
        CC [M]  /root/lustre-release/libcfs/libcfs/module.o
      /root/lustre-release/libcfs/libcfs/module.o: warning: objtool: libcfs_force_lbug() falls through to next function debugfs_doint()
        CC [M]  /root/lustre-release/libcfs/libcfs/tracefile.o
        CC [M]  /root/lustre-release/libcfs/libcfs/libcfs_string.o
        CC [M]  /root/lustre-release/libcfs/libcfs/hash.o
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_bd_del_locked() falls through to next function cfs_hash_cond_del_locked()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_bd_from_key() falls through to next function cfs_hash_buckets_free()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_bd_get() falls through to next function cfs_hash_multi_bd_findadd_locked.constprop.0()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_create() falls through to next function cfs_hash_bd_move_locked()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_bd_move_locked() falls through to next function cfs_hash_rehash_worker()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_dual_bd_get() falls through to next function cfs_hash_rehash_key()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_rehash_cancel() falls through to next function cfs_hash_putref()
      /root/lustre-release/libcfs/libcfs/hash.o: warning: objtool: cfs_hash_for_each_enter() falls through to next function cfs_hash_rehash()
      

      A full list is in the attached make_debs.log.
      According to https://github.com/torvalds/linux/blob/master/tools/objtool/Documentation/objtool.txt :

      8. file.o: warning: objtool: funcA() falls through to next function funcB()
      
         This means that funcA() doesn't end with a return instruction or an
         unconditional jump, and that objtool has determined that the function
         can fall through into the next function.  There could be different
         reasons for this:
      
         1) funcA()'s last instruction is a call to a "noreturn" function like
            panic().  In this case the noreturn function needs to be added to
            objtool's hard-coded global_noreturns array.  Feel free to bug the
            objtool maintainer, or you can submit a patch.
      
         2) funcA() uses the unreachable() annotation in a section of code
            that is actually reachable.
      
         3) If funcA() calls an inline function, the object code for funcA()
            might be corrupt due to a gcc bug.  For more details, see:
            https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      

      Attachments

        Issue Links

          Activity

            [LU-17793] warning: objtool: __cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set()

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55505
            Subject: LU-17793 libcfs: fix objtool warning in lbug_with_loc()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d54fa6b7853e943d8f35ab1136c9c524ef1331e0

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55505 Subject: LU-17793 libcfs: fix objtool warning in lbug_with_loc() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d54fa6b7853e943d8f35ab1136c9c524ef1331e0

            Quoting the relevant section of that document:

            If you really need objtool to ignore something, and are 100% sure that it won't affect kernel stack traces, you can tell objtool to ignore it:

            • To skip validation of a function, use the STACK_FRAME_NON_STANDARD macro.
            • To skip validation of a file, add "OBJECT_FILES_NON_STANDARD_filename.o := y" to the Makefile.
            • To skip validation of a directory, add "OBJECT_FILES_NON_STANDARD := y" to the Makefile.

            so it seems possible that we could use the STACK_FRAME_NON_STANDARD(lbug_with_loc) so it would be skipped by all of the callers? I don't want to annotate every function that calls LASSERT().

            Failing that, we could potentially fix the code in lbug_with_loc() so that it calls panic() after the while(1) loop, so that all of the call paths in that function "terminate" and hopefully this is enough to convince the static checkers to ignore this:

                    :
                    set_current_state(TASK_UNINTERRUPTIBLE);               
                    while (1)                             
                            schedule();
                    /* not reached */
                    panic();
            } 
            

            We might need to remove the __noreturn annotation from lbug_with_loc() in that case, since the panic() call should already be properly annotated and we would call it on all code paths in that case, even if the one at the end of the function is never actually reached.

            On a related note, I've been thinking it would be useful to add e.g. msleep(cfs_time_seconds(6)) before the panic() calls so that there is some chance that the dump_stack() and other error messages can be saved to the syslog rather than crashing the node instantaneously on error. I don't think this makes a huge difference in the grand scheme of failover/recovery time, but would make debugging easier. If someone was worried about the delay duing failover, this could be a tunable, either overloading libcfs_panic_on_lbug to be the number of seconds to delay (or msec, if a minimum 1 second delay is considered too long), or adding yet another tunable like libcfs_panic_delay_sec to control this.

            adilger Andreas Dilger added a comment - Quoting the relevant section of that document: If you really need objtool to ignore something, and are 100% sure that it won't affect kernel stack traces, you can tell objtool to ignore it: To skip validation of a function, use the STACK_FRAME_NON_STANDARD macro. To skip validation of a file, add " OBJECT_FILES_NON_STANDARD_filename.o := y " to the Makefile. To skip validation of a directory, add " OBJECT_FILES_NON_STANDARD := y " to the Makefile. so it seems possible that we could use the STACK_FRAME_NON_STANDARD(lbug_with_loc) so it would be skipped by all of the callers? I don't want to annotate every function that calls LASSERT() . Failing that, we could potentially fix the code in lbug_with_loc() so that it calls panic() after the while(1) loop, so that all of the call paths in that function "terminate" and hopefully this is enough to convince the static checkers to ignore this: : set_current_state(TASK_UNINTERRUPTIBLE); while (1) schedule(); /* not reached */ panic(); } We might need to remove the __noreturn annotation from lbug_with_loc() in that case, since the panic() call should already be properly annotated and we would call it on all code paths in that case, even if the one at the end of the function is never actually reached. On a related note, I've been thinking it would be useful to add e.g. msleep(cfs_time_seconds(6)) before the panic() calls so that there is some chance that the dump_stack() and other error messages can be saved to the syslog rather than crashing the node instantaneously on error. I don't think this makes a huge difference in the grand scheme of failover/recovery time, but would make debugging easier. If someone was worried about the delay duing failover, this could be a tunable, either overloading libcfs_panic_on_lbug to be the number of seconds to delay (or msec, if a minimum 1 second delay is considered too long), or adding yet another tunable like libcfs_panic_delay_sec to control this.

            This is a userspace tool, included in Linux kernel tree. It is run on each .o file at build time. As it provided by the kernel, this is hard to patch unfortunately.

            I just found there is a macro to skip some functions (instead of the whole file or a directory). To be tested, but that could be a bit better than the actual workaround.

            https://github.com/torvalds/linux/blob/master/tools/objtool/Documentation/objtool.txt#L443

            adegremont_nvda Aurelien Degremont added a comment - This is a userspace tool, included in Linux kernel tree. It is run on each .o file at build time. As it provided by the kernel, this is hard to patch unfortunately. I just found there is a macro to skip some functions (instead of the whole file or a directory). To be tested, but that could be a bit better than the actual workaround. https://github.com/torvalds/linux/blob/master/tools/objtool/Documentation/objtool.txt#L443

            I'm not familiar with objtool, is this a tool that runs as part of the build in userspace, or is it part of the kernel? If in userspace, could we supply our own (fixed) objtool as part of the Lustre sources that excludes this function as it did before?

            adilger Andreas Dilger added a comment - I'm not familiar with objtool, is this a tool that runs as part of the build in userspace, or is it part of the kernel? If in userspace, could we supply our own (fixed) objtool as part of the Lustre sources that excludes this function as it did before?
            yujian Jian Yu added a comment -

            Hi adilger,
            While building Lustre client on Ubuntu, we specify "--with-linux" with linux-headers path instead of kernel source path. So, it seems there is no way to patch tools/objtool/check.c in the kernel source.

            yujian Jian Yu added a comment - Hi adilger , While building Lustre client on Ubuntu, we specify "--with-linux" with linux-headers path instead of kernel source path. So, it seems there is no way to patch tools/objtool/check.c in the kernel source.

            Is there some way we can patch objtool during the Lustre module build process so that this warning no longer fails?

            adilger Andreas Dilger added a comment - Is there some way we can patch objtool during the Lustre module build process so that this warning no longer fails?
            yujian Jian Yu added a comment -

            The issue doesn't exist on kernel v6.4, but on v6.5. I narrowed down the commit to the following one:

            commit 34245659debd194cbd4148d2ee5176306bdf8899
            Author:     Josh Poimboeuf <jpoimboe@kernel.org>
            AuthorDate: Tue Apr 18 14:27:52 2023 -0700
            Commit:     Josh Poimboeuf <jpoimboe@kernel.org>
            CommitDate: Tue May 16 06:31:54 2023 -0700
            
                objtool: Remove superfluous global_noreturns entries
                
                lbug_with_loc() no longer exists, and resume_play_dead() is static
                (objtool only checks globals and weaks).
                
                Reviewed-by: Miroslav Benes <mbenes@suse.cz>
                Link: https://lore.kernel.org/r/2725d7f2ccc2361c6903de9ebaa2b5bb304f7ac2.1681853186.git.jpoimboe@kernel.org
                Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
            
            diff --git a/tools/objtool/check.c b/tools/objtool/check.c
            index 8dac1e35264e..8c2762de7ae3 100644
            --- a/tools/objtool/check.c
            +++ b/tools/objtool/check.c
            @@ -217,7 +217,6 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
                            "kthread_complete_and_exit",
                            "kthread_exit",
                            "kunit_try_catch_throw",
            -               "lbug_with_loc",
                            "machine_real_restart",
                            "make_task_dead",
                            "mpt_halt_firmware",
            @@ -225,7 +224,6 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
                            "panic",
                            "panic_smp_self_stop",
                            "rest_init",
            -               "resume_play_dead",
                            "rewind_stack_and_make_dead",
                            "sev_es_terminate",
                            "snp_abort",
            
            yujian Jian Yu added a comment - The issue doesn't exist on kernel v6.4, but on v6.5. I narrowed down the commit to the following one: commit 34245659debd194cbd4148d2ee5176306bdf8899 Author: Josh Poimboeuf <jpoimboe@kernel.org> AuthorDate: Tue Apr 18 14:27:52 2023 -0700 Commit: Josh Poimboeuf <jpoimboe@kernel.org> CommitDate: Tue May 16 06:31:54 2023 -0700 objtool: Remove superfluous global_noreturns entries lbug_with_loc() no longer exists, and resume_play_dead() is static (objtool only checks globals and weaks). Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https: //lore.kernel.org/r/2725d7f2ccc2361c6903de9ebaa2b5bb304f7ac2.1681853186.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 8dac1e35264e..8c2762de7ae3 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -217,7 +217,6 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func, "kthread_complete_and_exit" , "kthread_exit" , "kunit_try_catch_throw" , - "lbug_with_loc" , "machine_real_restart" , "make_task_dead" , "mpt_halt_firmware" , @@ -225,7 +224,6 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func, "panic" , "panic_smp_self_stop" , "rest_init" , - "resume_play_dead" , "rewind_stack_and_make_dead" , "sev_es_terminate" , "snp_abort" ,

            This seems a bit more complex, as objtool was actually run before, under the config option CONFIG_STACK_VALIDATION, for Ubuntu kernel 5.15 :/ So not sure why this problem appears now.

            But it seems objtool is confused by the __noreturn functions and does not detect that properly for Lustre...

            adegremont_nvda Aurelien Degremont added a comment - This seems a bit more complex, as objtool was actually run before, under the config option CONFIG_STACK_VALIDATION, for Ubuntu kernel 5.15 :/ So not sure why this problem appears now. But it seems objtool is confused by the __noreturn functions and does not detect that properly for Lustre...
            bfaccini-nvda Bruno Faccini added a comment - - edited

            I had a look to the object and assembly code, and this definitely appears as false positive from "objtool".
            If we use the "__cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set()" as an example, end of __cfs_fail_check_set() object code (from "objdump") looks like :

            .....................
             1f6:   83 fb 03                cmp    $0x3,%ebx
             1f9:   0f 84 25 ff ff ff       je     124 <__cfs_fail_check_set+0x114>
             1ff:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
             206:   89 d9                   mov    %ebx,%ecx
             208:   48 c7 c2 00 00 00 00    mov    $0x0,%rdx
             20f:   48 c7 c6 00 00 00 00    mov    $0x0,%rsi
             216:   c7 05 00 00 00 00 00    movl   $0x40000,0x0(%rip)        # 220 <__cfs_fail_check_set+0x210>
             21d:   00 04 00 
             220:   e8 00 00 00 00          call   225 <__cfs_fail_check_set+0x215>
             225:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
             22c:   e8 00 00 00 00          call   231 <__cfs_fail_check_set+0x221>
             231:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
             238:   00 00 00 00 
             23c:   0f 1f 40 00             nopl   0x0(%rax)
            
            
            0000000000000240 <__pfx___cfs_fail_timeout_set>:
             240:   90                      nop
             241:   90                      nop
             242:   90                      nop
             243:   90                      nop
             244:   90                      nop
             245:   90                      nop
             246:   90                      nop
             247:   90                      nop
             248:   90                      nop
             249:   90                      nop
             24a:   90                      nop
             24b:   90                      nop
             24c:   90                      nop
             24d:   90                      nop
             24e:   90                      nop
             24f:   90                      nop
            
            
            0000000000000250 <__cfs_fail_timeout_set>:
             250:   e8 00 00 00 00          call   255 <__cfs_fail_timeout_set+0x5>
             255:   55                      push   %rbp
             256:   48 89 e5                mov    %rsp,%rbp
             259:   41 57                   push   %r15
             25b:   41 89 d7                mov    %edx,%r15d
             25e:   41 56                   push   %r14
             260:   41 55                   push   %r13
            ...................

            where it looks like there is no return instructions sequence after the "call   231" instruction...

            But if we look at the corresponding relocated code in memory :

               0xffffffffc2404236 <+486>:   cmp    $0x3,%ebx
               0xffffffffc2404239 <+489>:   je     0xffffffffc2404164 <__cfs_fail_check_set+276>
               0xffffffffc240423f <+495>:   mov    $0xffffffffc24260e0,%rdi
               0xffffffffc2404246 <+502>:   mov    %ebx,%ecx
               0xffffffffc2404248 <+504>:   mov    $0xffffffffc247b478,%rdx
               0xffffffffc240424f <+511>:   mov    $0xffffffffc247c330,%rsi
               0xffffffffc2404256 <+518>:   movl   $0x40000,0x21e98(%rip)        # 0xffffffffc24260f8 <__msg_data.8+24>
               0xffffffffc2404260 <+528>:   call   0xffffffffc2407fa0 <libcfs_debug_msg>
               0xffffffffc2404265 <+533>:   mov    $0xffffffffc24260e0,%rdi
               0xffffffffc240426c <+540>:   call   0xffffffffc2403790 <lbug_with_loc>
            End of assembler dump. 

            we can see that the last call is to lbug_with_loc() which is declared with the "__noreturn" pragma.

            The fact that these "objtool" warnings only appear with recent Kernels, may simply come from the fact older Kernels had not been compiled with "CONFIG_OBJTOOL=y" in their config file...

            So we may only face with "objtool" limitation to handle "__noreturn" pragma for/in functions/modules outside Kernel core code (ie, not provided in "tools/objtool/noreturns.h") ??...

            And removing usage of "__noreturn" pragma for lbug_with_loc() does not help because then we get at least a compiler "error: this statement may fall through [-Werror=implicit-fallthrough=]" when lbug_with_loc()/LBUG() is being called in a switch/case path without a break/fallthrough statement following ...

             

            bfaccini-nvda Bruno Faccini added a comment - - edited I had a look to the object and assembly code, and this definitely appears as false positive from "objtool". If we use the " __cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set() " as an example, end of __cfs_fail_check_set() object code (from "objdump") looks like : ..................... 1f6:   83 fb 03                cmp    $0x3,%ebx 1f9:   0f 84 25 ff ff ff       je     124 <__cfs_fail_check_set+0x114> 1ff:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi 206:   89 d9                   mov    %ebx,%ecx 208:   48 c7 c2 00 00 00 00    mov    $0x0,%rdx 20f:   48 c7 c6 00 00 00 00    mov    $0x0,%rsi 216:   c7 05 00 00 00 00 00    movl   $0x40000,0x0(%rip)        # 220 <__cfs_fail_check_set+0x210> 21d:   00 04 00 220:   e8 00 00 00 00          call   225 <__cfs_fail_check_set+0x215> 225:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi 22c:   e8 00 00 00 00          call   231 <__cfs_fail_check_set+0x221> 231:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1) 238:   00 00 00 00 23c:   0f 1f 40 00             nopl   0x0(%rax) 0000000000000240 <__pfx___cfs_fail_timeout_set>: 240:   90                      nop 241:   90                      nop 242:   90                      nop 243:   90                      nop 244:   90                      nop 245:   90                      nop 246:   90                      nop 247:   90                      nop 248:   90                      nop 249:   90                      nop 24a:   90                      nop 24b:   90                      nop 24c:   90                      nop 24d:   90                      nop 24e:   90                      nop 24f:   90                      nop 0000000000000250 <__cfs_fail_timeout_set>: 250:   e8 00 00 00 00          call   255 <__cfs_fail_timeout_set+0x5> 255:   55                      push   %rbp 256:   48 89 e5                mov    %rsp,%rbp 259:   41 57                   push   %r15 25b:   41 89 d7                mov    %edx,%r15d 25e:   41 56                   push   %r14 260:   41 55                   push   %r13 ................... where it looks like there is no return instructions sequence after the "call   231" instruction... But if we look at the corresponding relocated code in memory :    0xffffffffc2404236 <+486>:   cmp    $0x3,%ebx    0xffffffffc2404239 <+489>:   je     0xffffffffc2404164 <__cfs_fail_check_set+276>    0xffffffffc240423f <+495>:   mov    $0xffffffffc24260e0,%rdi    0xffffffffc2404246 <+502>:   mov    %ebx,%ecx    0xffffffffc2404248 <+504>:   mov    $0xffffffffc247b478,%rdx    0xffffffffc240424f <+511>:   mov    $0xffffffffc247c330,%rsi    0xffffffffc2404256 <+518>:   movl   $0x40000,0x21e98(%rip)        # 0xffffffffc24260f8 <__msg_data.8+24>    0xffffffffc2404260 <+528>:   call   0xffffffffc2407fa0 <libcfs_debug_msg>    0xffffffffc2404265 <+533>:   mov    $0xffffffffc24260e0,%rdi    0xffffffffc240426c <+540>:   call   0xffffffffc2403790 <lbug_with_loc> End of assembler dump. we can see that the last call is to lbug_with_loc() which is declared with the "__noreturn" pragma. The fact that these "objtool" warnings only appear with recent Kernels, may simply come from the fact older Kernels had not been compiled with " CONFIG_OBJTOOL =y" in their config file... So we may only face with "objtool" limitation to handle "__noreturn" pragma for/in functions/modules outside Kernel core code (ie, not provided in "tools/objtool/noreturns.h") ??... And removing usage of "__noreturn" pragma for lbug_with_loc() does not help because then we get at least a compiler "error: this statement may fall through [-Werror=implicit-fallthrough=] " when lbug_with_loc()/LBUG() is being called in a switch/case path without a break/fallthrough statement following ...  

            Sorry, I was focussing on the "_cfs_fail_check_set() falls through to next function" and message, and did not notice the later cfs_hash* lines (which is indeed related to LU-8130).

            That said, even if LU-8130 was finished and all the cfs_hash_* code is gone, I think the __cfs_fail_check_set() case still needs to be investigated and fixed, if I could understand what is actually wrong in this case.

            adilger Andreas Dilger added a comment - Sorry, I was focussing on the " _ cfs_fail_check_set() falls through to next function" and message, and did not notice the later cfs_hash * lines (which is indeed related to LU-8130 ). That said, even if LU-8130 was finished and all the cfs_hash_* code is gone, I think the __cfs_fail_check_set() case still needs to be investigated and fixed, if I could understand what is actually wrong in this case.
            adegremont_nvda Aurelien Degremont added a comment - - edited

            adegremont_nvda, I don't think this ticket has anything to do with LU-8130, since these functions are all about fault injection for testing and unrelated to rhashtable?

            Andreas, actually you made the reference to LU-8130 in the first place, in your first comment . That's why I referenced it. I trusted you when you said this will be dropped, but I agree with your last comment, this is unrelated to LU-8130. Do you mean another bug ID ?

            adegremont_nvda Aurelien Degremont added a comment - - edited adegremont_nvda , I don't think this ticket has anything to do with LU-8130 , since these functions are all about fault injection for testing and unrelated to rhashtable? Andreas, actually you made the reference to LU-8130 in the first place, in your first comment . That's why I referenced it. I trusted you when you said this will be dropped, but I agree with your last comment, this is unrelated to LU-8130 . Do you mean another bug ID ?

            People

              yujian Jian Yu
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: