[LU-17540] sync and delay before LBUG() calls panic() - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- easy
- lug24dd

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

It would be useful to have a few second pause between when LBUG() is called and panic() is triggered, in order for the stack trace to be written to the serial console, and ideally also to give a chance for it to be written to /var/log/messages if no serial console is available.

The code currently calls panic() immediately after dumping the stack:

lbug_with_loc(struct libcfs_debug_msg_data *msgdata)
{
        libcfs_catastrophe = 1;
        libcfs_debug_msg(msgdata, "LBUG\n");

        if (in_interrupt()) {
                panic("LBUG in interrupt.\n");
                /* not reached */
        }

        libcfs_debug_dumpstack(NULL);
        if (libcfs_panic_on_lbug)
                panic("LBUG");
        else
                libcfs_debug_dumplog();
        set_current_state(TASK_UNINTERRUPTIBLE);
        while (1)
                schedule();
}

It would be reasonable to allow libcfs_panic_on_lbug() to store the number of seconds (or milliseconds?) to delay before calling panic(), probably using msleep() to busy-wait instead of being scheduled. In the meantime, a task could be dispatched to a work queue to try sync-and-flush for whatever can be written during this delay (if the system is not locked up), equivalent to "sysrq-w" and "sysrq-s".

Attachments

Issue Links

is related to

LU-16297 ptl_send_rpc() ASSERTION ( (at_max == 0) || imp->imp_state != LUSTRE_IMP_FULL || (imp->imp_msghdr_flags & MSGHDR_AT_SUPPORT) || !(imp->imp_connect_data.ocd_connect_flags & 0x1000000ULL) )

Resolved

is related to

LU-17793 warning: objtool: __cfs_fail_check_set() falls through to next function __cfs_fail_timeout_set()

Resolved

LU-16375 dump more information for threads blocked on local DLM locks

Open

LU-16625 improved Lustre thread debugging

In Progress

Activity

[LU-17540] sync and delay before LBUG() calls panic()

Andreas Dilger added a comment - 11/Jul/24 6:55 PM

There was a delay added between calling LBUG() and it calling panic() in patch https://review.whamcloud.com/55505 "LU-17793 libcfs: fix objtool warning in lbug_with_loc()" so this may allow the stack trace to be saved before the node is rebooted.

Otherwise we might need to add the sync before the sleep to start the write.

Andreas Dilger added a comment - 11/Jul/24 6:55 PM There was a delay added between calling LBUG() and it calling panic() in patch https://review.whamcloud.com/55505 " LU-17793 libcfs: fix objtool warning in lbug_with_loc() " so this may allow the stack trace to be saved before the node is rebooted. Otherwise we might need to add the sync before the sleep to start the write.

Tim Day added a comment - 14/Feb/24 5:55 AM

We could use BUG() and BUG_ON() within the LBUG() definition. Those macros dump a stack trace and panic. Plus, the traces should reliably get flushed out to console without us having to add delays.

Tim Day added a comment - 14/Feb/24 5:55 AM We could use BUG() and BUG_ON() within the LBUG() definition. Those macros dump a stack trace and panic. Plus, the traces should reliably get flushed out to console without us having to add delays.

People

Assignee:: Jian Yu

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Feb/24 1:49 AM

Updated:: 11/Jul/24 6:56 PM

Resolved:: 11/Jul/24 6:55 PM