[LU-15920] Interop parallel-scale-nfsv4: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 Created: 07/Jun/22 Updated: 05/Apr/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.9, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d5c9683a-6143-4ddc-8225-daa1bb8cb3fe Before parallel-scale-nfsv4 ran, all test nodes crashed as [27499.656301] SysRq : Trigger a crash [27499.657194] BUG: unable to handle kernel NULL pointer dereference at (null) [27499.658657] IP: [<ffffffff86a75596>] sysrq_handle_crash+0x16/0x20 [27499.660775] Oops: 0002 [#1] SMP 27499.677098] CPU: 1 PID: 7956 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.49.1.el7.x86_64 #1 [27499.678989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27499.681238] RIP: 0010:[<ffffffff86a75596>] [<ffffffff86a75596>] sysrq_handle_crash+0x16/0x20 [27499.693443] Call Trace: [27499.693909] [<ffffffff86a75dbd>] __handle_sysrq+0x10d/0x170 [27499.694905] [<ffffffff86a76228>] write_sysrq_trigger+0x28/0x40 [27499.695966] [<ffffffff868c76f0>] proc_reg_write+0x40/0x80 [27499.696947] [<ffffffff8684e590>] vfs_write+0xc0/0x1f0 [27499.699015] [<ffffffff8684f36f>] SyS_write+0x7f/0xf0 [27499.701145] [<ffffffff86d95f92>] system_call_fastpath+0x25/0x2a [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1160.49.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Nov 30 15:51:32 UTC 2021 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.49.1.el7.x86_64 ro console=ttyS0,115200 console=tty0 console=ttyS0,115200 net.ifnames=0 LANG=en_US.UTF-8 slab_nomerge slub_nomerge irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 elfcorehdr=867700K |
| Comments |
| Comment by Andreas Dilger [ 07/Jun/22 ] |
This looks like "sysrq-c" or "echo c > /proc/sysrq-trigger" was run on the nodes. I would suspect that autotest/ljb was doing this, maybe because of a problem in a previous test session, or similar? |
| Comment by Sarah Liu [ 14/Jun/22 ] |
|
A similar error happened in 2.12.9 testing with Ubuntu client, in conf-sanity test_83 MDS console shows [25136.581679] pickup 4951 7538652.811805 20 120 0.000000 9.608621 0.000000 0 / [25136.583882] bash 4998 7538802.056886 27 120 0.000000 10.687356 0.000000 0 / [25136.586090] R bash 5031 7539058.583248 2 120 0.000000 239.691301 0.000000 0 / [25136.588298] [25138.305731] SysRq : Trigger a crash [25138.306658] BUG: unable to handle kernel NULL pointer dereference at (null) [25138.308193] IP: [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20 [25138.309346] PGD 80000000b9ef1067 PUD bb79d067 PMD 0 [25138.310370] Oops: 0002 [#1] SMP [25138.311052] Modules linked in: dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper pcspkr ablk_helper cryptd joydev virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix virtio_net net_failover virtio_blk failover crct10dif_pclmul crct10dif_common crc32c_intel libata serio_raw virtio_pci virtio_ring virtio floppy [last unloaded: dm_flakey] [25138.328431] CPU: 0 PID: 5080 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.49.1.el7_lustre.x86_64 #1 [25138.330535] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [25138.331592] task: ffff9966d2eea100 ti: ffff9966bda8c000 task.ti: ffff9966bda8c000 [25138.332945] RIP: 0010:[<ffffffff91075746>] [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20 [25138.334517] RSP: 0018:ffff9966bda8fe58 EFLAGS: 00010246 [25138.335489] RAX: ffffffff91075730 RBX: ffffffff918e74a0 RCX: 0000000000000000 [25138.336770] RDX: 0000000000000000 RSI: ffff9966ffc138d8 RDI: 0000000000000063 [25138.338069] RBP: ffff9966bda8fe58 R08: ffffffff91c0387c R09: ffff9966c7d60f00 [25138.339372] R10: 000000000000467c R11: 000000000000467b R12: 0000000000000063 [25138.340657] R13: 0000000000000000 R14: 0000000000000008 R15: 0000000000000000 [25138.341946] FS: 00007fb90a545740(0000) GS:ffff9966ffc00000(0000) knlGS:0000000000000000 [25138.343407] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [25138.344455] CR2: 0000000000000000 CR3: 000000009a4a0000 CR4: 00000000000606f0 [25138.345742] Call Trace: [25138.346217] [<ffffffff91075f6d>] __handle_sysrq+0x10d/0x170 [25138.347246] [<ffffffff910763d8>] write_sysrq_trigger+0x28/0x40 [25138.348324] [<ffffffff90ec78a0>] proc_reg_write+0x40/0x80 [25138.349325] [<ffffffff90e4e590>] vfs_write+0xc0/0x1f0 [25138.350267] [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a [25138.351462] [<ffffffff90e4f36f>] SyS_write+0x7f/0xf0 [25138.352385] [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a [25138.353598] [<ffffffff91395f92>] system_call_fastpath+0x25/0x2a [25138.354682] [<ffffffff91395ed5>] ? system_call_after_swapgs+0xa2/0x13a [25138.355869] Code: eb 9b 45 01 f4 45 39 65 34 75 e5 4c 89 ef e8 e2 f7 ff ff eb db 66 66 66 66 90 55 48 89 e5 c7 05 91 27 7d 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 0e [25138.361612] RIP [<ffffffff91075746>] sysrq_handle_crash+0x16/0x20 [25138.362768] RSP <ffff9966bda8fe58> [25138.363421] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1160.49.1.el7_lustre.x86_64 (jenkins@onyx-203-el7-x8664-1.onyx.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jun 2 13:15:53 UTC 2022 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.49.1.el7_lustre.x86_64 ro console=ttyS0,115200 console=tty0 console=ttyS0,115200 net.ifnames=0 LANG=en_US.UTF-8 slab_nomerge slub_nomerge irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 elfcorehdr=867700K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009f7ff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000021000000-0x0000000034f5cfff] usable [ 0.000000] BIOS-e820: [mem 0x0000000034fff800-0x0000000034ffffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bfffa000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.4 present. [ 0.000000] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011 |
| Comment by Andreas Dilger [ 15/Jun/22 ] |
This is again not so much a "bug" but rather something (user or test script) that is triggering a crash with "echo c > /proc/sysrq-trigger". That might be autotest timing out the test and trying to reboot the node, or the test script doing 'hard' failover, I'm not sure. It is likely to be something in the autotest area, so I've added Charlie and Lee to the CC list. |
| Comment by Charlie Olmstead [ 15/Jun/22 ] |
|
Since https://jira.whamcloud.com/browse/ATM-2248 was deployed, AT forces crash dumps on all nodes in the cluster upon a timeout. In the session referenced in the description, parallel-scale-nfsv4 actually started but timed out in the suite's startup phase. AFAIK, each suite goes through this startup phase but it is not recorded in auster's results.yml file which can make situations like this confusing. There is also a suite cleanup phase which is also not recorded. autotest_execution.log - https://testing.whamcloud.com/test_logs/8ae1cf61-4779-449e-8431-e63e8620123e/show_text 2022-06-01T14:09:16 Suite parallel-scale-nfsv4 started 2022-06-01T14:10:15 Test running for 397 minutes 2022-06-01T14:11:15 Test running for 398 minutes 2022-06-01T14:12:15 Test running for 399 minutes ... 2022-06-01T15:06:15 Test running for 453 minutes 2022-06-01T15:07:15 Test running for 454 minutes 2022-06-01T15:08:15 Test running for 455 minutes 2022-06-01T15:09:10 Timing session out! Timeout has been exceeded but the test framework is still running 2022-06-01T15:09:10 Killing test framework... 2022-06-01T15:09:10 Test framework killed 2022-06-01T15:09:10 Suite parallel-scale-nfsv4 timed out, collecting logs... 2022-06-01T15:09:10 capturing timeout logs from trevis-42vm1 2022-06-01T15:09:15 Test running for 456 minutes 2022-06-01T15:09:15 capturing timeout logs from trevis-42vm2 2022-06-01T15:09:19 capturing timeout logs from trevis-42vm3 2022-06-01T15:09:26 capturing timeout logs from trevis-42vm4 2022-06-01T15:09:31 Forcing crash dumps on all nodes... 2022-06-01T15:09:31 trevis-42vm1 - Forcing crash dump... 2022-06-01T15:09:31 trevis-42vm2 - Forcing crash dump... 2022-06-01T15:09:31 trevis-42vm3 - Forcing crash dump... 2022-06-01T15:09:31 trevis-42vm4 - Forcing crash dump... 2022-06-01T15:10:15 Test running for 457 minutes 2022-06-01T15:10:15 Forcing crash dumps complete 2022-06-01T15:10:15 Collecting crash files...
|