Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11062

Backtrace stack printing is broken in RHEL 7.5

Details

    • 3
    • 9223372036854775807

    Description

      It looks like struct stacktrace_ops is no longer there in rhel7.5 so our detection of if dump_stack would work is no longer working.

      We get this now:

      [29185.979935] LNet: Service thread pid 9373 was inactive for 40.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [29185.981691] LNet: 31237:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel doesn't export show_task
      [29185.982826] LustreError: dumping log to /tmp/lustre-log.1527605717.9373
      

      Given that struct stacktrace_ops are still there in mainline kernels, this seems something specific to rhel7.5 and we need to find another way of detecting this I guess.

      Attachments

        Issue Links

          Activity

            [LU-11062] Backtrace stack printing is broken in RHEL 7.5

            John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32972/
            Subject: LU-11062 libcfs: use save_stack_trace for stack dump
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: a2af371bd8a79e293a9ba95b8016de92040101a6

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32972/ Subject: LU-11062 libcfs: use save_stack_trace for stack dump Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: a2af371bd8a79e293a9ba95b8016de92040101a6
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32952/
            Subject: LU-11062 libcfs: use save_stack_trace for stack dump
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: afedf9343686504c89f2e28cf6133540166f2347

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32952/ Subject: LU-11062 libcfs: use save_stack_trace for stack dump Project: fs/lustre-release Branch: master Current Patch Set: Commit: afedf9343686504c89f2e28cf6133540166f2347

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32972
            Subject: LU-11062 libcfs: use save_stack_trace for stack dump
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 7232c445fe30f6500f6f731ef8ffad617490eb68

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32972 Subject: LU-11062 libcfs: use save_stack_trace for stack dump Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 7232c445fe30f6500f6f731ef8ffad617490eb68
            ys Yang Sheng added a comment -

            I have submitted a patch for this ticket. Just use save_stack_trace_tsk for backtrace dump. The obvious problem is that unable to judge a address whether reliable or not. This will make things confused in some case.

            ys Yang Sheng added a comment - I have submitted a patch for this ticket. Just use save_stack_trace_tsk for backtrace dump. The obvious problem is that unable to judge a address whether reliable or not. This will make things confused in some case.

            Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32952
            Subject: LU-11062 libcfs: use save_stack_trace for stack dump
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc344d10030aa60385f0577a855993733b38c916

            gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32952 Subject: LU-11062 libcfs: use save_stack_trace for stack dump Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc344d10030aa60385f0577a855993733b38c916
            scadmin SC Admin added a comment -

            FWIW we got hung tasks in LU-11082 and got this message and no stack traces. in retrospect I guess I should have tried sysrq to get stack traces before rebooting. I've got used to Lustre doing it for me.

            cheers,
            robin

            scadmin SC Admin added a comment - FWIW we got hung tasks in LU-11082 and got this message and no stack traces. in retrospect I guess I should have tried sysrq to get stack traces before rebooting. I've got used to Lustre doing it for me. cheers, robin
            simmonsja James A Simmons added a comment - - edited

            I was talking to Neil and he recommend we use save_stack_trace_tsk() since it been around since 2.6.29.

            Also of note Shadow pointed out that the kernel does have a soft lockup thread (CONFIG_LOCKUP_DETECTOR). It appears enabled for RHEL kernels. Their is also a detect hung_task functionality.

            simmonsja James A Simmons added a comment - - edited I was talking to Neil and he recommend we use save_stack_trace_tsk() since it been around since 2.6.29. Also of note Shadow pointed out that the kernel does have a soft lockup thread (CONFIG_LOCKUP_DETECTOR). It appears enabled for RHEL kernels. Their is also a detect hung_task functionality.
            green Oleg Drokin added a comment -

            hangcheck timer reports on idle threads that did not schedule in a while or some such (it needs to be compiled in and enabled), but it's not quite what we need since our timeouts are different and we have a bit different model of when to start the countdown and when to stop it.

            green Oleg Drokin added a comment - hangcheck timer reports on idle threads that did not schedule in a while or some such (it needs to be compiled in and enabled), but it's not quite what we need since our timeouts are different and we have a bit different model of when to start the countdown and when to stop it.

            James, do you have any pointers to the kernel mechanism? We already get NMI watchdogs from the kernel, but those are when the thread doesn't schedule for a long time. Our current code can dump the stack on another service thread that is not making progress, even though it isn't totally dead (e.g. in a sleep/check loop). The thread can "ping" the watchdog to tell it is is alive, and lack of pings == lack of progress. If the kernel can do something similar, especially if it's been around a while, then I'd be happy to see it.

            adilger Andreas Dilger added a comment - James, do you have any pointers to the kernel mechanism? We already get NMI watchdogs from the kernel, but those are when the thread doesn't schedule for a long time. Our current code can dump the stack on another service thread that is not making progress, even though it isn't totally dead (e.g. in a sleep/check loop). The thread can "ping" the watchdog to tell it is is alive, and lack of pings == lack of progress. If the kernel can do something similar, especially if it's been around a while, then I'd be happy to see it.

            People

              ys Yang Sheng
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: