[LU-16379] sanity/any crash on aarch64: SError Interrupt on CPU0 CPU1 Created: 10/Dec/22  Updated: 20/Nov/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1c3cc10c-d43b-4a76-a3fd-2968b785a11a

Two VMs crashed at the same time with the same stack.
Possibly triggered by autotest itself? It doesn't appear to be in Lustre code.

[ 1858.957465] Lustre: DEBUG MARKER: == sanity test 27q: append to truncated file with all OSTs full (should error) ========================================================== 22:12:42 (1670537562)
[ 1869.105310] SError Interrupt on CPU1, code 0xbe000000 -- SError
[ 1869.105317] CPU: 1 PID: 19 Comm: ksoftirqd/1 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.9.1.el8.aarch64 #1
[ 1869.105320] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 1869.105322] pstate: 80000085 (Nzcv daIf -PAN -UAO)
[ 1869.105324] pc : post_ttbr_update_workaround+0x1c/0x28
[ 1869.105326] lr : check_and_switch_context+0x250/0x668

[ 1869.105398] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 1869.105401] CPU: 1 PID: 19 Comm: ksoftirqd/1 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.9.1.el8.aarch64 #1
[ 1869.105403] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 1869.105405] Call trace:
[ 1869.105407]  dump_backtrace+0x0/0x160
[ 1869.105409]  show_stack+0x28/0x38
[ 1869.105410]  dump_stack+0x5c/0x74
[ 1869.105412]  panic+0x140/0x30c
[ 1869.105413]  nmi_panic+0x84/0x88
[ 1869.105415]  arm64_serror_panic+0x84/0x90
[ 1869.105417]  is_valid_bugaddr+0x0/0x20
[ 1869.105418]  el1_error+0x84/0x100
[ 1869.105420]  post_ttbr_update_workaround+0x1c/0x28
[ 1869.105422]  check_and_switch_context+0x250/0x668
[ 1869.105424]  __schedule+0x794/0x9c8
[ 1869.105425]  schedule+0x48/0xc8
[ 1869.105427]  smpboot_thread_fn+0x198/0x1a0
[ 1869.105428]  kthread+0x128/0x138


 Comments   
Comment by Etienne Aujames [ 14/Dec/22 ]

Same type of calltrace from sanity-sec test_41:

[20523.561245] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,test_dummy_encryption onyx-71vm3@tcp:/lustre /mnt/lustre
[20523.751090] Lustre: Test dummy encryption mode enabled
[20523.754839] Lustre: Skipped 1 previous similar message
[20535.335539] SError Interrupt on CPU1, code 0xbe000000 -- SError
[20535.335545] CPU: 1 PID: 1207142 Comm: ldlm_bl_01 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.9.1.el8.aarch64 #1
[20535.335548] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[20535.335550] pstate: 80000085 (Nzcv daIf -PAN -UAO)
[20535.335552] pc : post_ttbr_update_workaround+0x1c/0x28
[20535.335554] lr : check_and_switch_context+0x250/0x668
[20535.335556] sp : ffff8000183efc50
[20535.335558] x29: ffff8000183efc50 x28: ffff800012270c50 
[20535.335564] x27: 0000000000000001 x26: 0000000000000000 
[20535.335568] x25: ffff800011c7d5e0 x24: ffff800011c78788 
[20535.335573] x23: ffff800012270000 x22: 0000000000000001 
[20535.335577] x21: ffff800011c7d5e0 x20: ffff00004b4eb000 
[20535.335582] x19: ffff800011811900 x18: 0000000000000000 
[20535.335587] x17: 0000000000000000 x16: 000000000032058c 
[20535.335591] x15: 0000000000000001 x14: c3e0000000000000 
[20535.335595] x13: 0000000000000000 x12: 0000000000000000 
[20535.335600] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[20535.335605] x9 : 0000000000000001 x8 : ffff00004b4eb3e8 
[20535.335609] x7 : 000000000000023a x6 : ffff800011c7d000 
[20535.335614] x5 : ffff8000ae670000 x4 : 000000004b4d4400 
[20535.335618] x3 : 000000008b4d4400 x2 : 052a000041680000 
[20535.335623] x1 : 000000000032052a x0 : ffff8000100a3018 
[20535.335627] Kernel panic - not syncing: Asynchronous SError Interrupt
[20535.335630] CPU: 1 PID: 1207142 Comm: ldlm_bl_01 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.9.1.el8.aarch64 #1
[20535.335633] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[20535.335634] Call trace:
[20535.335636]  dump_backtrace+0x0/0x160
[20535.335638]  show_stack+0x28/0x38
[20535.335639]  dump_stack+0x5c/0x74
[20535.335641]  panic+0x140/0x30c
[20535.335642]  nmi_panic+0x84/0x88
[20535.335644]  arm64_serror_panic+0x84/0x90
[20535.335646]  is_valid_bugaddr+0x0/0x20
[20535.335648]  el1_error+0x84/0x100
[20535.335649]  post_ttbr_update_workaround+0x1c/0x28
[20535.335651]  check_and_switch_context+0x250/0x668
[20535.335653]  __schedule+0x794/0x9c8
[20535.335654]  schedule+0x48/0xc8
[20535.335656]  ldlm_bl_thread_main+0x5e0/0xc60 [ptlrpc]
[20535.335658]  kthread+0x128/0x138
[20535.335660]  ret_from_fork+0x10/0x18
Comment by Jian Yu [ 13/Apr/23 ]

+1 from lnet-selftest: https://testing.whamcloud.com/test_sets/3457736b-6688-4ee6-87cf-a5275e41f801

Comment by Andreas Dilger [ 26/Apr/23 ]

+1 sanity test_154g on two client nodes at the same time, even though only one was actually running the test:
https://testing.whamcloud.com/test_sets/07b52e4c-ef0c-4d2a-81e9-d1e8740a32cf

[14043.475192] Lustre: DEBUG MARKER: == sanity test 154g: various llapi FID tests ============= 06:26:50 (1682404010)
[14088.427141] hrtimer: interrupt took 108464871 ns
[14169.089691] SError Interrupt on CPU0, code 0xbe000000 -- SError
[14169.089701] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[14169.089705] pc : post_ttbr_update_workaround+0x1c/0x28
[14169.089707] lr : cpu_do_switch_mm+0x78/0x88
[14169.089781] Kernel panic - not syncing: Asynchronous SError Interrupt
[14169.089783] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.10.1.el8_7.aarch64 #1
[14169.089786] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[14169.089788] Call trace:
[14169.089789]  dump_backtrace+0x0/0x178
[14169.089791]  show_stack+0x28/0x38
[14169.089793]  dump_stack+0x5c/0x74
[14169.089794]  panic+0x140/0x30c
[14169.089796]  nmi_panic+0x84/0x88
[14169.089798]  arm64_serror_panic+0x84/0x90
[14169.089799]  is_valid_bugaddr+0x0/0x20
[14169.089801]  el1_error+0x84/0x104
[14169.089803]  post_ttbr_update_workaround+0x1c/0x28
[14169.089804]  cpu_do_switch_mm+0x78/0x88
[14169.089806]  check_and_switch_context+0x254/0x670
[14169.089808]  __schedule+0x814/0xac0
[14169.089810]  schedule_idle+0x28/0x48
[14169.089811]  do_idle+0x1a4/0x280
[14169.089813]  cpu_startup_entry+0x30/0x38
[14169.089815]  rest_init+0xbc/0xc8
[14169.089816]  start_kernel+0x4ec/0x51c

Seems like some kind of watchdog timer caused both nodes to be rebooted.

Comment by Andreas Dilger [ 26/Apr/23 ]

kevin.zhao, xinliang, do these stack traces mean anything to you? It looks like a semi-random crash during various different tests, but happening at the same time across multiple VMs on the same host. The above two crashes were on onyx-91vm11 and onyx-91vm12 at 06:28:56, while the below was running a totally different build/test and crashed at the same time on onyx-91vm15 at 06:28:57 (assuming that the "DEBUG MARKER" is printed at the same time):
https://testing.whamcloud.com/test_sets/41648601-abfe-46f6-9f42-8fa5e08d4b98

[ 4237.884883] Lustre: DEBUG MARKER: == sanity test 51f: check many open files limit ====== 06:28:30 (1682404110)
[ 4239.896171] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
[ 4264.398465] SError Interrupt on CPU1, code 0xbe000000 -- SError
[ 4264.398472] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.13.1.el8_7.aarch64 #1
[ 4264.398475] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 4264.398477] pstate: 80000085 (Nzcv daIf -PAN -UAO BTYPE=--)
[ 4264.398479] pc : post_ttbr_update_workaround+0x1c/0x28
[ 4264.398481] lr : cpu_do_switch_mm+0x78/0x88
[ 4264.398564] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 4264.398567] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded 4.18.0-425.13.1.el8_7.aarch64 #1
[ 4264.398569] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 4264.398571] Call trace:
[ 4264.398573]  dump_backtrace+0x0/0x178
[ 4264.398574]  show_stack+0x28/0x38
[ 4264.398576]  dump_stack+0x5c/0x74
[ 4264.398578]  panic+0x140/0x30c
[ 4264.398579]  nmi_panic+0x84/0x88
[ 4264.398581]  arm64_serror_panic+0x84/0x90
[ 4264.398582]  is_valid_bugaddr+0x0/0x20
[ 4264.398584]  el1_error+0x84/0x104
[ 4264.398586]  post_ttbr_update_workaround+0x1c/0x28
[ 4264.398588]  cpu_do_switch_mm+0x78/0x88
[ 4264.398590]  check_and_switch_context+0x254/0x670
[ 4264.398591]  __schedule+0x814/0xac0
[ 4264.398593]  schedule_idle+0x28/0x48
[ 4264.398594]  do_idle+0x1a4/0x280
[ 4264.398596]  cpu_startup_entry+0x30/0x38
[ 4264.398598]  secondary_start_kernel+0x144/0x150

Is this some kind of hardware issue with the host CPU, or overloaded CPU/RAM and can't dedicate enough cycles to the VM and starve them for handling interrupts?

Comment by Xinliang Liu [ 04/May/23 ]

Sorry for the late reply, Andreas. Will take a look at this issue.

Comment by Xinliang Liu [ 06/May/23 ]

From the test logs, it seems that the VMs crash issue happened on host onyx-91.

From the log line "SError Interrupt on CPU1, code 0xbe000000 – SError",

We can only know this is a SError (System Error). Because the value of Exception Syndrome Register(ESR) is 0xbe000000. According to the Arm ESR doc (searching "SError" parts), bit 24 is 0, meaning no further RAS info only indicating it is a SError.

 

SError is a memory system access error, maybe not a software error.  See the definition below from Arm Exception doc:

"

System Error (SError) is an exception type that is intended to be generated by the memory system in response to unexpected events. We do not expect these events, but need to know if they have occurred. These are reported asynchronously because the instruction that triggered the event may have already been retired.

A typical example of SError is what was previously referred to as an external asynchronous abort. Examples of SError interrupts include:

  • Memory access which has passed all the MMU checks but then encounters an error on the memory bus
  • Parity or Error Correction Code (ECC) checking on some RAMs, for example those in built-in caches
  • An abort triggered by write-back of dirty data from a cache line to external memory

"

So please check if there is any kernel errors in host onyx-91.

 

 

Comment by Jian Yu [ 20/Nov/23 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/9d9ad1be-b85e-4c00-801e-4c51d88d34af

Generated at Sat Feb 10 03:26:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.