[LU-16379] sanity/any crash on aarch64: SError Interrupt on CPU0 CPU1 Created: 10/Dec/22 Updated: 20/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1c3cc10c-d43b-4a76-a3fd-2968b785a11a Two VMs crashed at the same time with the same stack. [ 1858.957465] Lustre: DEBUG MARKER: == sanity test 27q: append to truncated file with all OSTs full (should error) ========================================================== 22:12:42 (1670537562) [ 1869.105310] SError Interrupt on CPU1, code 0xbe000000 -- SError [ 1869.105317] CPU: 1 PID: 19 Comm: ksoftirqd/1 Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.9.1.el8.aarch64 #1 [ 1869.105320] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 1869.105322] pstate: 80000085 (Nzcv daIf -PAN -UAO) [ 1869.105324] pc : post_ttbr_update_workaround+0x1c/0x28 [ 1869.105326] lr : check_and_switch_context+0x250/0x668 [ 1869.105398] Kernel panic - not syncing: Asynchronous SError Interrupt [ 1869.105401] CPU: 1 PID: 19 Comm: ksoftirqd/1 Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.9.1.el8.aarch64 #1 [ 1869.105403] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 1869.105405] Call trace: [ 1869.105407] dump_backtrace+0x0/0x160 [ 1869.105409] show_stack+0x28/0x38 [ 1869.105410] dump_stack+0x5c/0x74 [ 1869.105412] panic+0x140/0x30c [ 1869.105413] nmi_panic+0x84/0x88 [ 1869.105415] arm64_serror_panic+0x84/0x90 [ 1869.105417] is_valid_bugaddr+0x0/0x20 [ 1869.105418] el1_error+0x84/0x100 [ 1869.105420] post_ttbr_update_workaround+0x1c/0x28 [ 1869.105422] check_and_switch_context+0x250/0x668 [ 1869.105424] __schedule+0x794/0x9c8 [ 1869.105425] schedule+0x48/0xc8 [ 1869.105427] smpboot_thread_fn+0x198/0x1a0 [ 1869.105428] kthread+0x128/0x138 |
| Comments |
| Comment by Etienne Aujames [ 14/Dec/22 ] |
|
Same type of calltrace from sanity-sec test_41: [20523.561245] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,test_dummy_encryption onyx-71vm3@tcp:/lustre /mnt/lustre [20523.751090] Lustre: Test dummy encryption mode enabled [20523.754839] Lustre: Skipped 1 previous similar message [20535.335539] SError Interrupt on CPU1, code 0xbe000000 -- SError [20535.335545] CPU: 1 PID: 1207142 Comm: ldlm_bl_01 Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.9.1.el8.aarch64 #1 [20535.335548] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [20535.335550] pstate: 80000085 (Nzcv daIf -PAN -UAO) [20535.335552] pc : post_ttbr_update_workaround+0x1c/0x28 [20535.335554] lr : check_and_switch_context+0x250/0x668 [20535.335556] sp : ffff8000183efc50 [20535.335558] x29: ffff8000183efc50 x28: ffff800012270c50 [20535.335564] x27: 0000000000000001 x26: 0000000000000000 [20535.335568] x25: ffff800011c7d5e0 x24: ffff800011c78788 [20535.335573] x23: ffff800012270000 x22: 0000000000000001 [20535.335577] x21: ffff800011c7d5e0 x20: ffff00004b4eb000 [20535.335582] x19: ffff800011811900 x18: 0000000000000000 [20535.335587] x17: 0000000000000000 x16: 000000000032058c [20535.335591] x15: 0000000000000001 x14: c3e0000000000000 [20535.335595] x13: 0000000000000000 x12: 0000000000000000 [20535.335600] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f [20535.335605] x9 : 0000000000000001 x8 : ffff00004b4eb3e8 [20535.335609] x7 : 000000000000023a x6 : ffff800011c7d000 [20535.335614] x5 : ffff8000ae670000 x4 : 000000004b4d4400 [20535.335618] x3 : 000000008b4d4400 x2 : 052a000041680000 [20535.335623] x1 : 000000000032052a x0 : ffff8000100a3018 [20535.335627] Kernel panic - not syncing: Asynchronous SError Interrupt [20535.335630] CPU: 1 PID: 1207142 Comm: ldlm_bl_01 Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.9.1.el8.aarch64 #1 [20535.335633] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [20535.335634] Call trace: [20535.335636] dump_backtrace+0x0/0x160 [20535.335638] show_stack+0x28/0x38 [20535.335639] dump_stack+0x5c/0x74 [20535.335641] panic+0x140/0x30c [20535.335642] nmi_panic+0x84/0x88 [20535.335644] arm64_serror_panic+0x84/0x90 [20535.335646] is_valid_bugaddr+0x0/0x20 [20535.335648] el1_error+0x84/0x100 [20535.335649] post_ttbr_update_workaround+0x1c/0x28 [20535.335651] check_and_switch_context+0x250/0x668 [20535.335653] __schedule+0x794/0x9c8 [20535.335654] schedule+0x48/0xc8 [20535.335656] ldlm_bl_thread_main+0x5e0/0xc60 [ptlrpc] [20535.335658] kthread+0x128/0x138 [20535.335660] ret_from_fork+0x10/0x18 |
| Comment by Jian Yu [ 13/Apr/23 ] |
|
+1 from lnet-selftest: https://testing.whamcloud.com/test_sets/3457736b-6688-4ee6-87cf-a5275e41f801 |
| Comment by Andreas Dilger [ 26/Apr/23 ] |
|
+1 sanity test_154g on two client nodes at the same time, even though only one was actually running the test: [14043.475192] Lustre: DEBUG MARKER: == sanity test 154g: various llapi FID tests ============= 06:26:50 (1682404010) [14088.427141] hrtimer: interrupt took 108464871 ns [14169.089691] SError Interrupt on CPU0, code 0xbe000000 -- SError [14169.089701] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [14169.089705] pc : post_ttbr_update_workaround+0x1c/0x28 [14169.089707] lr : cpu_do_switch_mm+0x78/0x88 [14169.089781] Kernel panic - not syncing: Asynchronous SError Interrupt [14169.089783] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.10.1.el8_7.aarch64 #1 [14169.089786] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [14169.089788] Call trace: [14169.089789] dump_backtrace+0x0/0x178 [14169.089791] show_stack+0x28/0x38 [14169.089793] dump_stack+0x5c/0x74 [14169.089794] panic+0x140/0x30c [14169.089796] nmi_panic+0x84/0x88 [14169.089798] arm64_serror_panic+0x84/0x90 [14169.089799] is_valid_bugaddr+0x0/0x20 [14169.089801] el1_error+0x84/0x104 [14169.089803] post_ttbr_update_workaround+0x1c/0x28 [14169.089804] cpu_do_switch_mm+0x78/0x88 [14169.089806] check_and_switch_context+0x254/0x670 [14169.089808] __schedule+0x814/0xac0 [14169.089810] schedule_idle+0x28/0x48 [14169.089811] do_idle+0x1a4/0x280 [14169.089813] cpu_startup_entry+0x30/0x38 [14169.089815] rest_init+0xbc/0xc8 [14169.089816] start_kernel+0x4ec/0x51c Seems like some kind of watchdog timer caused both nodes to be rebooted. |
| Comment by Andreas Dilger [ 26/Apr/23 ] |
|
kevin.zhao, xinliang, do these stack traces mean anything to you? It looks like a semi-random crash during various different tests, but happening at the same time across multiple VMs on the same host. The above two crashes were on onyx-91vm11 and onyx-91vm12 at 06:28:56, while the below was running a totally different build/test and crashed at the same time on onyx-91vm15 at 06:28:57 (assuming that the "DEBUG MARKER" is printed at the same time): [ 4237.884883] Lustre: DEBUG MARKER: == sanity test 51f: check many open files limit ====== 06:28:30 (1682404110) [ 4239.896171] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 [ 4264.398465] SError Interrupt on CPU1, code 0xbe000000 -- SError [ 4264.398472] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G OE --------- - - 4.18.0-425.13.1.el8_7.aarch64 #1 [ 4264.398475] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 4264.398477] pstate: 80000085 (Nzcv daIf -PAN -UAO BTYPE=--) [ 4264.398479] pc : post_ttbr_update_workaround+0x1c/0x28 [ 4264.398481] lr : cpu_do_switch_mm+0x78/0x88 [ 4264.398564] Kernel panic - not syncing: Asynchronous SError Interrupt [ 4264.398567] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded 4.18.0-425.13.1.el8_7.aarch64 #1 [ 4264.398569] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 4264.398571] Call trace: [ 4264.398573] dump_backtrace+0x0/0x178 [ 4264.398574] show_stack+0x28/0x38 [ 4264.398576] dump_stack+0x5c/0x74 [ 4264.398578] panic+0x140/0x30c [ 4264.398579] nmi_panic+0x84/0x88 [ 4264.398581] arm64_serror_panic+0x84/0x90 [ 4264.398582] is_valid_bugaddr+0x0/0x20 [ 4264.398584] el1_error+0x84/0x104 [ 4264.398586] post_ttbr_update_workaround+0x1c/0x28 [ 4264.398588] cpu_do_switch_mm+0x78/0x88 [ 4264.398590] check_and_switch_context+0x254/0x670 [ 4264.398591] __schedule+0x814/0xac0 [ 4264.398593] schedule_idle+0x28/0x48 [ 4264.398594] do_idle+0x1a4/0x280 [ 4264.398596] cpu_startup_entry+0x30/0x38 [ 4264.398598] secondary_start_kernel+0x144/0x150 Is this some kind of hardware issue with the host CPU, or overloaded CPU/RAM and can't dedicate enough cycles to the VM and starve them for handling interrupts? |
| Comment by Xinliang Liu [ 04/May/23 ] |
|
Sorry for the late reply, Andreas. Will take a look at this issue. |
| Comment by Xinliang Liu [ 06/May/23 ] |
|
From the test logs, it seems that the VMs crash issue happened on host onyx-91. From the log line "SError Interrupt on CPU1, code 0xbe000000 – SError", We can only know this is a SError (System Error). Because the value of Exception Syndrome Register(ESR) is 0xbe000000. According to the Arm ESR doc (searching "SError" parts), bit 24 is 0, meaning no further RAS info only indicating it is a SError.
SError is a memory system access error, maybe not a software error. See the definition below from Arm Exception doc: " System Error (SError) is an exception type that is intended to be generated by the memory system in response to unexpected events. We do not expect these events, but need to know if they have occurred. These are reported asynchronously because the instruction that triggered the event may have already been retired. A typical example of SError is what was previously referred to as an external asynchronous abort. Examples of SError interrupts include:
" So please check if there is any kernel errors in host onyx-91.
|
| Comment by Jian Yu [ 20/Nov/23 ] |
|
+1 on master branch: https://testing.whamcloud.com/test_sets/9d9ad1be-b85e-4c00-801e-4c51d88d34af |