[LU-15920] Interop parallel-scale-nfsv4: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 Created: 07/Jun/22  Updated: 05/Apr/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.9, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-15950 conf-sanity test_101:Timeout occurred... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d5c9683a-6143-4ddc-8225-daa1bb8cb3fe

Before parallel-scale-nfsv4 ran, all test nodes crashed as

[27499.656301] SysRq : Trigger a crash
[27499.657194] BUG: unable to handle kernel NULL pointer dereference at           (null)
[27499.658657] IP: [<ffffffff86a75596>] sysrq_handle_crash+0x16/0x20
[27499.660775] Oops: 0002 [#1] SMP 
27499.677098] CPU: 1 PID: 7956 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.49.1.el7.x86_64 #1
[27499.678989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[27499.681238] RIP: 0010:[<ffffffff86a75596>]  [<ffffffff86a75596>] sysrq_handle_crash+0x16/0x20
[27499.693443] Call Trace:
[27499.693909]  [<ffffffff86a75dbd>] __handle_sysrq+0x10d/0x170
[27499.694905]  [<ffffffff86a76228>] write_sysrq_trigger+0x28/0x40
[27499.695966]  [<ffffffff868c76f0>] proc_reg_write+0x40/0x80
[27499.696947]  [<ffffffff8684e590>] vfs_write+0xc0/0x1f0
[27499.699015]  [<ffffffff8684f36f>] SyS_write+0x7f/0xf0
[27499.701145]  [<ffffffff86d95f92>] system_call_fastpath+0x25/0x2a
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1160.49.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Nov 30 15:51:32 UTC 2021
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.49.1.el7.x86_64 ro console=ttyS0,115200 console=tty0 console=ttyS0,115200 net.ifnames=0 LANG=en_US.UTF-8 slab_nomerge slub_nomerge irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 elfcorehdr=867700K


 Comments   
Comment by Andreas Dilger [ 07/Jun/22 ]
[27499.656301] SysRq : Trigger a crash
[27499.657194] BUG: unable to handle kernel NULL pointer dereference at           (null)
[27499.658657] IP: [<ffffffff86a75596>] sysrq_handle_crash+0x16/0x20

This looks like "sysrq-c" or "echo c > /proc/sysrq-trigger" was run on the nodes. I would suspect that autotest/ljb was doing this, maybe because of a problem in a previous test session, or similar?

Comment by Sarah Liu [ 14/Jun/22 ]

A similar error happened in 2.12.9 testing with Ubuntu client, in conf-sanity test_83
https://testing.whamcloud.com/test_sets/504767ec-1f5b-46c8-a1fa-2f39e5aad0d4

MDS console shows

[25136.581679]           pickup  4951   7538652.811805        20   120         0.000000         9.608621         0.000000 0 /
[25136.583882]             bash  4998   7538802.056886        27   120         0.000000        10.687356         0.000000 0 /
[25136.586090] R           bash  5031   7539058.583248         2   120         0.000000       239.691301         0.000000 0 /
[25136.588298] 
[25138.305731] SysRq : Trigger a crash
[25138.306658] BUG: unable to handle kernel NULL pointer dereference at           (null)
[25138.308193] IP: [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20
[25138.309346] PGD 80000000b9ef1067 PUD bb79d067 PMD 0 
[25138.310370] Oops: 0002 [#1] SMP 
[25138.311052] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper pcspkr ablk_helper cryptd joydev virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix virtio_net net_failover virtio_blk failover crct10dif_pclmul crct10dif_common crc32c_intel libata serio_raw virtio_pci virtio_ring virtio floppy [last unloaded: dm_flakey]
[25138.328431] CPU: 0 PID: 5080 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1
[25138.330535] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[25138.331592] task: ffff9966d2eea100 ti: ffff9966bda8c000 task.ti: ffff9966bda8c000
[25138.332945] RIP: 0010:[<ffffffff91075746>]  [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20
[25138.334517] RSP: 0018:ffff9966bda8fe58  EFLAGS: 00010246
[25138.335489] RAX: ffffffff91075730 RBX: ffffffff918e74a0 RCX: 0000000000000000
[25138.336770] RDX: 0000000000000000 RSI: ffff9966ffc138d8 RDI: 0000000000000063
[25138.338069] RBP: ffff9966bda8fe58 R08: ffffffff91c0387c R09: ffff9966c7d60f00
[25138.339372] R10: 000000000000467c R11: 000000000000467b R12: 0000000000000063
[25138.340657] R13: 0000000000000000 R14: 0000000000000008 R15: 0000000000000000
[25138.341946] FS:  00007fb90a545740(0000) GS:ffff9966ffc00000(0000) knlGS:0000000000000000
[25138.343407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25138.344455] CR2: 0000000000000000 CR3: 000000009a4a0000 CR4: 00000000000606f0
[25138.345742] Call Trace:
[25138.346217]  [<ffffffff91075f6d>] __handle_sysrq+0x10d/0x170
[25138.347246]  [<ffffffff910763d8>] write_sysrq_trigger+0x28/0x40
[25138.348324]  [<ffffffff90ec78a0>] proc_reg_write+0x40/0x80
[25138.349325]  [<ffffffff90e4e590>] vfs_write+0xc0/0x1f0
[25138.350267]  [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a
[25138.351462]  [<ffffffff90e4f36f>] SyS_write+0x7f/0xf0
[25138.352385]  [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a
[25138.353598]  [<ffffffff91395f92>] system_call_fastpath+0x25/0x2a
[25138.354682]  [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a
[25138.355869] Code: eb 9b 45 01 f4 45 39 65 34 75 e5 4c 89 ef e8 e2 f7 ff ff eb db 66 66 66 66 90 55 48 89 e5 c7 05 91 27 7d 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 0e 
[25138.361612] RIP  [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20
[25138.362768]  RSP <ffff9966bda8fe58>
[25138.363421] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1160.49.1.el7_lustre.x86_64 (jenkins@onyx-203-el7-x8664-1.onyx.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jun 2 13:15:53 UTC 2022
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.49.1.el7_lustre.x86_64 ro console=ttyS0,115200 console=tty0 console=ttyS0,115200 net.ifnames=0 LANG=en_US.UTF-8 slab_nomerge slub_nomerge irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 elfcorehdr=867700K
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009f7ff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000021000000-0x0000000034f5cfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000034fff800-0x0000000034ffffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bfffa000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.4 present.
[    0.000000] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
Comment by Andreas Dilger [ 15/Jun/22 ]
[25138.305731] SysRq : Trigger a crash
[25138.306658] BUG: unable to handle kernel NULL pointer dereference at           (null)
[25138.308193] IP: [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20

This is again not so much a "bug" but rather something (user or test script) that is triggering a crash with "echo c > /proc/sysrq-trigger". That might be autotest timing out the test and trying to reboot the node, or the test script doing 'hard' failover, I'm not sure. It is likely to be something in the autotest area, so I've added Charlie and Lee to the CC list.

Comment by Charlie Olmstead [ 15/Jun/22 ]

Since https://jira.whamcloud.com/browse/ATM-2248 was deployed, AT forces crash dumps on all nodes in the cluster upon a timeout.

In the session referenced in the description, parallel-scale-nfsv4 actually started but timed out in the suite's startup phase. AFAIK, each suite goes through this startup phase but it is not recorded in auster's results.yml file which can make situations like this confusing. There is also a suite cleanup phase which is also not recorded.

autotest_execution.log - https://testing.whamcloud.com/test_logs/8ae1cf61-4779-449e-8431-e63e8620123e/show_text

2022-06-01T14:09:16 Suite parallel-scale-nfsv4 started
2022-06-01T14:10:15 Test running for 397 minutes
2022-06-01T14:11:15 Test running for 398 minutes
2022-06-01T14:12:15 Test running for 399 minutes
...
2022-06-01T15:06:15 Test running for 453 minutes
2022-06-01T15:07:15 Test running for 454 minutes
2022-06-01T15:08:15 Test running for 455 minutes
2022-06-01T15:09:10 Timing session out! Timeout has been exceeded but the test framework is still running
2022-06-01T15:09:10 Killing test framework...
2022-06-01T15:09:10 Test framework killed
2022-06-01T15:09:10 Suite parallel-scale-nfsv4 timed out, collecting logs...
2022-06-01T15:09:10 capturing timeout logs from trevis-42vm1
2022-06-01T15:09:15 Test running for 456 minutes
2022-06-01T15:09:15 capturing timeout logs from trevis-42vm2
2022-06-01T15:09:19 capturing timeout logs from trevis-42vm3
2022-06-01T15:09:26 capturing timeout logs from trevis-42vm4
2022-06-01T15:09:31 Forcing crash dumps on all nodes...
2022-06-01T15:09:31 trevis-42vm1 - Forcing crash dump...
2022-06-01T15:09:31 trevis-42vm2 - Forcing crash dump...
2022-06-01T15:09:31 trevis-42vm3 - Forcing crash dump...
2022-06-01T15:09:31 trevis-42vm4 - Forcing crash dump...
2022-06-01T15:10:15 Test running for 457 minutes
2022-06-01T15:10:15 Forcing crash dumps complete
2022-06-01T15:10:15 Collecting crash files...

 

Generated at Sat Feb 10 03:22:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.