[LU-11222] parallel-scale-nfsv3 test racer_on_nfs crashes with ‘BUG: unable to handle kernel paging request at ffffffc09d0c20ff’ Created: 06/Aug/18 Updated: 25/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5, Lustre 2.10.7, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
parallel-scale-nfsv3 test_racer_on_nfs crashes. The MDS console log and the kernel-crash log at https://testing.whamcloud.com/test_sets/2d3ea616-98f5-11e8-b0aa-52540065bddc have the same stack trace: [30173.104517] LustreError: 1490:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200024df2:0x17c44:0x0] get parent: rc = -2 [30173.105932] LustreError: 1490:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [30196.764494] LustreError: 1488:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200024df2:0x17e63:0x0] get parent: rc = -2 [30196.766036] LustreError: 1488:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [30207.660367] BUG: unable to handle kernel paging request at ffffffc09d0c20ff [30207.661195] IP: [<ffffffc09d0c20ff>] 0xffffffc09d0c20ff [30207.661746] PGD 32a12067 PUD 0 [30207.662121] Oops: 0010 [#1] SMP [30207.665969] Modules linked in: nfsd nfs_acl osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi i2c_piix4 ppdev crc32_pclmul ghash_clmulni_intel virtio_balloon aesni_intel i2c_core lrw gf128mul joydev pcspkr glue_helper ablk_helper cryptd parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel libata serio_raw 8139cp mii virtio_pci virtio_ring virtio floppy [last unloaded: lnet_selftest] [30207.675994] CPU: 0 PID: 28926 Comm: mdt00_000 Kdump: loaded Tainted: G OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [30207.677274] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [30207.677887] task: ffff90e06779bf40 ti: ffff90e035688000 task.ti: ffff90e035688000 [30207.678612] RIP: 0010:[<ffffffc09d0c20ff>] [<ffffffc09d0c20ff>] 0xffffffc09d0c20ff [30207.679379] RSP: 0018:ffff90e07fc03eb8 EFLAGS: 00010286 [30207.679905] RAX: ffffffc09d0c20ff RBX: ffffffff9d273000 RCX: 0000000003b861f2 [30207.680591] RDX: ffff90e05c09fc27 RSI: ffff90e05fea8200 RDI: ffff90e05c09fc27 [30207.681286] RBP: ffff90e07fc03f10 R08: 000000000001ba80 R09: ffffffff9c74b2fc [30207.681976] R10: ffff90e07fc1ba80 R11: ffffd7c3c1a28f40 R12: 000000000000000a [30207.682663] R13: 0000000000000005 R14: ff90e05fea9e28ff R15: ffff90e07fc14340 [30207.683355] FS: 0000000000000000(0000) GS:ffff90e07fc00000(0000) knlGS:0000000000000000 [30207.684139] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [30207.684696] CR2: ffffffc09d0c20ff CR3: 0000000077eb8000 CR4: 00000000000606f0 [30207.685398] Call Trace: [30207.685677] <IRQ> [30207.716757] [<ffffffff9c74b2b0>] ? rcu_process_callbacks+0x1e0/0x580 [30207.718073] [<ffffffff9c69b085>] __do_softirq+0xf5/0x280 [30207.719573] [<ffffffff9cd23cec>] call_softirq+0x1c/0x30 [30207.720706] [<ffffffff9c62d625>] do_softirq+0x65/0xa0 [30207.721233] [<ffffffff9c69b405>] irq_exit+0x105/0x110 [30207.721743] [<ffffffff9cd25068>] smp_apic_timer_interrupt+0x48/0x60 [30207.722396] [<ffffffff9cd217b2>] apic_timer_interrupt+0x162/0x170 [30207.723013] <EOI> [30207.726126] [<ffffffff9c95ae40>] ? memcpy+0x10/0x110 [30207.726693] [<ffffffff9c958194>] ? vsnprintf+0x234/0x6a0 [30207.727282] [<ffffffffc0763305>] libcfs_debug_vmsg2+0x2f5/0xb40 [libcfs] [30207.727973] [<ffffffffc0763ba7>] libcfs_debug_msg+0x57/0x80 [libcfs] [30207.728607] [<ffffffff9c74a39d>] ? call_rcu_sched+0x1d/0x20 [30207.757212] [<ffffffffc0bf0432>] ldlm_handle_enqueue0+0x2b2/0x16a0 [ptlrpc] [30207.758514] [<ffffffffc0c18e00>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [30207.765482] [<ffffffffc0c76452>] tgt_enqueue+0x62/0x210 [ptlrpc] [30207.766199] [<ffffffffc0c7a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [30207.766932] [<ffffffffc0c22e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [30207.767745] [<ffffffff9c6c52ab>] ? __wake_up_common+0x5b/0x90 [30207.768357] [<ffffffffc0c26592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [30207.769008] [<ffffffffc0c25b00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [30207.769743] [<ffffffff9c6bb621>] kthread+0xd1/0xe0 [30207.770236] [<ffffffff9c6bb550>] ? insert_kthread_work+0x40/0x40 [30207.770840] [<ffffffff9cd205f7>] ret_from_fork_nospec_begin+0x21/0x21 [30207.771487] [<ffffffff9c6bb550>] ? insert_kthread_work+0x40/0x40 [30207.772094] Code: Bad RIP value. [30207.772483] RIP [<ffffffc09d0c20ff>] 0xffffffc09d0c20ff [30207.773039] RSP <ffff90e07fc03eb8> [30207.773390] CR2: ffffffc09d0c20ff There are many instances of racer_on_nfs crashing recently, but I haven’t been able to find one with a matching stack trace. For this crash, we are testing RHEL 7.5 servers with ldiskfs targets and SLES 12 SP3 clients |
| Comments |
| Comment by James Nunez (Inactive) [ 16/Aug/18 ] |
|
There are a couple of test sessions that crash with a kernet paging request error with different stack trace [95227.228427] LustreError: 3086:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [95273.131467] BUG: unable to handle kernel paging request at ffffffc0cc6c20ff [95273.132427] IP: [<ffffffc0cc6c20ff>] 0xffffffc0cc6c20ff [95273.132984] PGD 48e12067 PUD 0 [95273.133341] Oops: 0010 [#1] SMP [95273.133749] Modules linked in: nfsd nfs_acl osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel ppdev lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 pcspkr joydev parport_pc virtio_balloon i2c_core parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix [95273.142187] virtio_blk crct10dif_pclmul crct10dif_common 8139too libata crc32c_intel 8139cp virtio_pci serio_raw virtio_ring virtio mii floppy [last unloaded: lnet_selftest] [95273.143969] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [95273.145142] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [95273.145703] task: ffff94443c18af70 ti: ffff94443c198000 task.ti: ffff94443c198000 [95273.146423] RIP: 0010:[<ffffffc0cc6c20ff>] [<ffffffc0cc6c20ff>] 0xffffffc0cc6c20ff [95273.147202] RSP: 0018:ffff94443fd03eb8 EFLAGS: 00010286 [95273.147732] RAX: ffffffc0cc6c20ff RBX: ffffffffade73000 RCX: 0000000009fd5f41 [95273.148427] RDX: ffff94442ec26027 RSI: ffffe2b301bb0980 RDI: ffff94442ec26027 [95273.149132] RBP: ffff94443fd03f10 R08: 000000000001ba80 R09: ffffffffad34b2fc [95273.149828] R10: ffff94443fd1ba80 R11: ffffe2b301831540 R12: 000000000000000a [95273.150520] R13: 0000000000000010 R14: ff94442ec27828ff R15: ffff94443fd14340 [95273.151208] FS: 0000000000000000(0000) GS:ffff94443fd00000(0000) knlGS:0000000000000000 [95273.152004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95273.152568] CR2: ffffffc0cc6c20ff CR3: 0000000078f36000 CR4: 00000000000606e0 [95273.153260] Call Trace: [95273.153524] <IRQ> [95273.153762] [<ffffffffad34b2b0>] ? rcu_process_callbacks+0x1e0/0x580 [95273.154511] [<ffffffffad29b085>] __do_softirq+0xf5/0x280 [95273.155056] [<ffffffffad923cec>] call_softirq+0x1c/0x30 [95273.155589] [<ffffffffad22d625>] do_softirq+0x65/0xa0 [95273.156093] [<ffffffffad29b405>] irq_exit+0x105/0x110 [95273.156605] [<ffffffffad925068>] smp_apic_timer_interrupt+0x48/0x60 [95273.157224] [<ffffffffad9217b2>] apic_timer_interrupt+0x162/0x170 [95273.157834] <EOI> [95273.158035] [<ffffffffad915e80>] ? __cpuidle_text_start+0x8/0x8 [95273.158656] [<ffffffffad916086>] ? native_safe_halt+0x6/0x10 [95273.159209] [<ffffffffad915e9e>] default_idle+0x1e/0xc0 [95273.159732] [<ffffffffad2356f3>] arch_cpu_idle+0x23/0xb0 [95273.160263] [<ffffffffad2f335a>] cpu_startup_entry+0x14a/0x1e0 [95273.160852] [<ffffffffad255f97>] start_secondary+0x1f7/0x270 [95273.161423] [<ffffffffad2000d5>] start_cpu+0x5/0x14 [95273.161907] Code: Bad RIP value. [95273.162284] RIP [<ffffffc0cc6c20ff>] 0xffffffc0cc6c20ff [95273.162834] RSP <ffff94443fd03eb8> [95273.163179] CR2: ffffffc0cc6c20ff https://testing.whamcloud.com/test_sets/9b2ef2f4-a109-11e8-8ee3-52540065bddc Is this the same issue or requires a different ticket? |
| Comment by Sarah Liu [ 25/Mar/22 ] |
|
+1 in interop testing between 2.12.7 server and master client |
| Comment by Sarah Liu [ 15/Jun/22 ] |
|
another one in interop: https://testing.whamcloud.com/test_sets/7fc6fb10-6700-42b4-a627-c965ec0c0959 |