[LU-16562] sanity test_408: aarch64 crash with NULL pointer dereference at virtual address 00000000000000a0 Created: 15/Feb/23  Updated: 25/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: arm

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite runs (both aarch64 clients):
https://testing.whamcloud.com/test_sets/76f58da2-38d6-4961-a077-b0c64d6674b6
https://testing.whamcloud.com/test_sets/eeff3d31-fa01-4b09-9f15-6aa7a8e5f021

test_408 failed with the following error:

onyx-91vm11 crashed during sanity test_408

[19747.669390] Lustre: DEBUG MARKER: == sanity test 408: drop_caches should not hang due to page leaks ========================================================== 21:17:51 (1676409471)
[19747.880492] Lustre: *** cfs_fail_loc=40a, val=0***
[19747.884133] LustreError: 631631:0:(osc_request.c:2756:osc_build_rpc()) prep_req failed: -22
[19747.890192] LustreError: 631631:0:(osc_cache.c:2199:osc_check_rpcs()) Read request failed with -22
[19749.278274] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[19749.282908] bash (1056941): drop_caches: 2
[19749.319912] Internal error: Oops: 96000005 [#1] SMP

[19749.356011] CPU: 0 PID: 1057184 Comm: ldlm_bl_06 Kdump: loaded 4.18.0-372.32.1.el8_6.aarch64 #1
[19749.365795] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[19749.371401] pstate: 80000005 (Nzcv daif -PAN -UAO)
[19749.375217] pc : ll_lock_cancel_bits+0x7b0/0xfa0 [lustre]
[19749.379986] lr : ll_lock_cancel_bits+0x54/0xfa0 [lustre]
[19749.449339] Process ldlm_bl_06 (pid: 1057184, stack limit = 0x00000000b232b25f)
[19749.455034] Call trace:
[19749.456943]  ll_lock_cancel_bits+0x7b0/0xfa0 [lustre]
[19749.461172]  ll_md_blocking_ast+0x1d0/0x410 [lustre]
[19749.465320]  ldlm_cancel_callback+0x74/0x368 [ptlrpc]
[19749.470399]  ldlm_cli_cancel_local+0x100/0x7a8 [ptlrpc]
[19749.474833]  ldlm_cli_cancel_list_local+0x118/0x440 [ptlrpc]
[19749.479597]  ldlm_bl_thread_main+0x920/0xc60 [ptlrpc]
[19749.483892]  kthread+0x128/0x138

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/92347 - 4.18.0-372.32.1.el8_6.aarch64
servers: https://build.whamcloud.com/job/lustre-reviews/92347 - 4.18.0-348.23.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_408 - onyx-91vm11 crashed during sanity test_408



 Comments   
Comment by Andreas Dilger [ 16/Feb/23 ]

It looks like the first such failure was 2023-02-14. Patches landed in that timeframe are:

$ git log --oneline --after 2023-02-13 --before 2023-02-15 master
eed4d4c752 LU-16536 osp: don't cleanup ldlm in precleanup phase
1c8b40d5e4 LU-16493 tests: recovery-small/144b to wait longer
19c38f6c94 LU-16515 clio: Remove cl_page_size()  *
1f034cf610 LU-16532 sec: session key bad keyring *
7fe7f4ca06 LU-16520 build: Move strscpy to libcfs common header
7fcef255d2 LU-16502 lutf: cleanup lutf_start.py, fix bugs
3cd0bb6968 LU-16502 lutf: fix bugs in bash scripts
9a72c073d3 LU-16494 fileset: check fileset for operations by fid *
a2de6af65d LU-16479 utils: Add option to manage degraded ZFS OST
90e1f2ee0c LU-16428 tests: cache is_project_quota_supported result
d2b633226e LU-16382 spec: use pkgconfig() as appropriate.
941d59e7b9 LU-16382 spec: Don't include Group: tags.
9cb4b10c87 LU-14224 misc: add firewalld service configuration
3c69d46e17 LU-14111 obdclass: count eviction per obd_device ?
511bf2f4cc LU-16501 tgt: skip free inodes in OST weights
51136f2dc6 LU-6142 lov: use list_for_each_entry in lov_obd.c *
c1936c9d29 LU-14918 osd: don't declare similar zfs writes twice
9e6225b2e7 LU-14918 osd: don't declare similar ldiskfs writes twice
f16c31ccd9 LU-16454 mdt: Add a per-MDT "max_mod_rpcs_in_flight"

Patches marked by '*' affect the client, and '?' might affect the client, so there aren't many candidate patches that could have triggered this.

Comment by James A Simmons [ 25/Aug/23 ]

Do you think patch https://review.whamcloud.com/c/fs/lustre-release/+/47086 resolved this?

Generated at Sat Feb 10 03:28:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.