[LU-3686] Save the cpu regs at lbug error time. Created: 02/Aug/13  Updated: 13/Feb/19  Resolved: 13/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Keith Mannthey (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Rank (Obsolete): 9518

 Description   

I have recently noticed what we don't seem to have error time cpu registers from lbugs panics. We get a nice stack trace but that is about where the debugging goodness ends. We don't get the error time regs saved anywhere all and most of the error time data context is lost. This is because we don't take an exception to sort out the error but rather we detect certain conditions and decide what to do within the process context itself. On x86 anyway most Linux errors are detected in exceptions and have the error time context saved on the stack just as part of the processes.

Presently gcc and x86 like to keep the first 6 function args in registers so very little data is actually on the stack. Even with a cool stack unwinder crash extension I only have access to 1 of about 10 data points I would like to be able to investigate.

So how to we do this?

I think the best plan is at the initial lbug macro context to save the regs at onto the stack for x86. I don't know as much about how Linux PPC or other Non-Linux clients will work.

In Linux arch/x86/include/asm/calling.h there is a asm macro called SAVE_ARGS. I think SAVE_ARGS will pack the register values onto the stack in a pt_regs standard sort of way.

There is a little bit of development to enable it in a friendly way.

I have opened this Lu to track this idea. I don't see myself having the time to work it all out for a while but I don't want to loose sight of it.

Not having the error time Regs makes a crashdump not as useful as one would hope for when we it comes from an lbug error.



 Comments   
Comment by Keith Mannthey (Inactive) [ 14/Aug/13 ]

So some feel xbt or dwarft2 undwinder should be use to unwind the data. I recognize there might be individual toolchain that I can use but I don't really see how it helps the next person. The dwarft2 undwind is going to require individual development to get working and xbt is beta and didn't work very well. I am sure others in the community would not mind a simpler way to triage crash dumps without messing with unwind solutions.

Andi Kleen even mentioned that unwinding can get out "most" of the data that just having the errors time regs would get you.

I really don't think is really complicated and I have started working on proof in concept of saving the regs to the stack for preservation.

Comment by John Hammond [ 14/Aug/13 ]

How about if you send me a bug report? Seems more productive than just going on Jira and saying it "didn't work very well."

> Andi Kleen even mentioned that unwinding can get out "most" of the data that just having the errors time regs would get you.

This is backwards. For most of our use cases symbolic unwinding gets you to the answer much faster than having a save of the registers.

Comment by Keith Mannthey (Inactive) [ 14/Aug/13 ]

John, I just sent you and email about my xbt experience sorry for not doing it sooner.

This LU is not about just about xbt not working very well for me.

This is a custom tool chain v. general tool chain issue as well. What if someone was looking at a crashdump on-site or in an isolated environment not on their base setup? Crash is everywhere on x86.

Generated at Sat Feb 10 01:36:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.