[LU-8584] parallel-scale-nfsv3 test_racer_on_nfs: BUG: unable to handle kernel paging request Created: 06/Sep/16 Updated: 24/Apr/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.4, Lustre 2.12.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Saurabh Tandan (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Interop: 2.8.0 EL7.2 Server/EL7.2 Client |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/e91156c4-7276-11e6-8afd-5254006e85c2. The sub-test test_racer_on_nfs failed with the following error: test failed to respond and timed out test log: ./file_exec.sh: line 13: 23336 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 25007 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 26378 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 31986 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 1896 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 1981 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 2975 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null ./file_exec.sh: line 13: 6421 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null racer cleanup sleeping 5 sec ... file_create.sh: no process found dir_create.sh: no process found file_rm.sh: no process found file_rename.sh: no process found file_link.sh: no process found file_symlink.sh: no process found file_list.sh: no process found file_concat.sh: no process found file_exec.sh: no process found file_chown.sh: no process found file_chmod.sh: no process found file_mknod.sh: no process found file_truncate.sh: no process found file_delxattr.sh: no process found file_getxattr.sh: no process found file_setxattr.sh: no process found there should be NO racer processes: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND Client dmesg: df D ffff88005170fa80 0 6802 12502 0x00000080 [ 7070.189011] ffff88005170fa08 0000000000000082 ffff8800799c5c00 ffff88005170ffd8 [ 7070.189011] ffff88005170ffd8 ffff88005170ffd8 ffff8800799c5c00 ffff8800799c5c00 [ 7070.189011] ffff88007ff66c58 0000000000000082 ffffffffa0225d20 ffff88005170fa80 [ 7070.189011] Call Trace: [ 7070.189011] [<ffffffffa0225d20>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc] [ 7070.189011] [<ffffffff8163bc59>] schedule+0x29/0x70 [ 7070.189011] [<ffffffffa0225d55>] rpc_wait_bit_killable+0x35/0x80 [sunrpc] [ 7070.189011] [<ffffffff81639ad0>] __wait_on_bit+0x60/0x90 [ 7070.189011] [<ffffffffa0225d20>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc] [ 7070.189011] [<ffffffff81639b87>] out_of_line_wait_on_bit+0x87/0xb0 [ 7070.189011] [<ffffffff810a6ba0>] ? wake_atomic_t_function+0x40/0x40 [ 7070.189011] [<ffffffffa021cfe0>] ? call_bind_status+0x290/0x290 [sunrpc] [ 7070.189011] [<ffffffffa021cfe0>] ? call_bind_status+0x290/0x290 [sunrpc] [ 7070.189011] [<ffffffffa0226c44>] __rpc_execute+0x154/0x420 [sunrpc] [ 7070.189011] [<ffffffff810a6ae5>] ? wake_up_bit+0x25/0x30 [ 7070.189011] [<ffffffffa022877e>] rpc_execute+0x5e/0xa0 [sunrpc] [ 7070.189011] [<ffffffffa021e240>] rpc_run_task+0x70/0x90 [sunrpc] [ 7070.189011] [<ffffffffa021e2b0>] rpc_call_sync+0x50/0xc0 [sunrpc] [ 7070.189011] [<ffffffffa06bf5cb>] nfs3_rpc_wrapper.constprop.10+0x6b/0xb0 [nfsv3] [ 7070.189011] [<ffffffffa06c02a6>] nfs3_proc_getattr+0x56/0xb0 [nfsv3] [ 7070.189011] [<ffffffffa0514f7f>] __nfs_revalidate_inode+0xbf/0x310 [nfs] [ 7070.189011] [<ffffffffa0515722>] nfs_revalidate_inode+0x22/0x60 [nfs] [ 7070.189011] [<ffffffff811e9bdc>] ? lookup_fast+0xcc/0x2e0 [ 7070.189011] [<ffffffffa050cc5b>] nfs_weak_revalidate+0x4b/0xf0 [nfs] [ 7070.189011] [<ffffffff811e9257>] complete_walk+0x87/0xe0 [ 7070.189011] [<ffffffff811ec21f>] do_last+0x46f/0x1270 [ 7070.189011] [<ffffffff812f96f2>] ? radix_tree_lookup_slot+0x22/0x50 [ 7070.189011] [<ffffffff811eed02>] path_openat+0xc2/0x490 [ 7070.189011] [<ffffffff811f04cb>] do_filp_open+0x4b/0xb0 [ 7070.189011] [<ffffffff811fd057>] ? __alloc_fd+0xa7/0x130 [ 7070.189011] [<ffffffff811dde73>] do_sys_open+0xf3/0x1f0 [ 7070.189011] [<ffffffff811ddf8e>] SyS_open+0x1e/0x20 [ 7070.189011] [<ffffffff81646c49>] system_call_fastpath+0x16/0x1b MDS console: 20:40:23:[ 3465.447651] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 20:35:50 (1472960150) 20:40:23:[ 3499.433180] LustreError: 10949:0:(mdt_handler.c:893:mdt_getattr_internal()) lustre-MDT0000: getattr error for [0x200000bd2:0x1800:0x0]: rc = -2 20:40:23:[ 3499.438171] LustreError: 10949:0:(mdt_handler.c:893:mdt_getattr_internal()) Skipped 1 previous similar message 20:40:23:[ 3735.618177] BUG: unable to handle kernel paging request at ffffffa07d7040ff 20:40:23:[ 3735.619144] IP: [<ffffffa07d7040ff>] 0xffffffa07d7040fe 20:40:23:[ 3735.619144] PGD 194d067 PUD 0 20:40:23:[ 3735.619144] Oops: 0010 [#1] SMP 20:40:23:[ 3735.619144] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm 8139too ata_piix libata drm serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core floppy 20:40:23:[ 3735.619144] CPU: 0 PID: 13 Comm: rcuos/1 Tainted: G OE ------------ 3.10.0-327.3.1.el7_lustre.x86_64 #1 20:40:23:[ 3735.619144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 20:40:23:[ 3735.619144] task: ffff88007c058b80 ti: ffff88007c060000 task.ti: ffff88007c060000 20:40:23:[ 3735.619144] RIP: 0010:[<ffffffa07d7040ff>] [<ffffffa07d7040ff>] 0xffffffa07d7040fe 20:40:23:[ 3735.619144] RSP: 0018:ffff88007c063e10 EFLAGS: 00010286 20:40:23:[ 3735.619144] RAX: ffff88007c063fd8 RBX: ffff880040551227 RCX: 0000000180400035 20:40:23:[ 3735.619144] RDX: ffffffa07d7040ff RSI: ffffea000188a4c0 RDI: ffff880040551227 20:40:23:[ 3735.619144] RBP: ffff88007c063ec0 R08: ffff880062293840 R09: 0000000180400035 20:40:23:[ 3735.619144] R10: ffffea000188a4c0 R11: ffffffff81123f70 R12: 0000000000000001 20:40:23:[ 3735.619144] R13: ff880008399070ff R14: 0000000000000039 R15: ffff880039ef1750 20:40:23:[ 3735.619144] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 20:40:23:[ 3735.619144] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 20:40:23:[ 3735.619144] CR2: ffffffa07d7040ff CR3: 00000000364cf000 CR4: 00000000000006f0 20:40:23:[ 3735.619144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 20:40:23:[ 3735.619144] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 This issue was first seen on 08/23/2016 |
| Comments |
| Comment by Sarah Liu [ 12/Sep/16 ] |
|
it also failed under regular config, RHEL7.2 server and client |
| Comment by Sarah Liu [ 09/Mar/17 ] |
|
another instance seen on master: https://testing.hpdd.intel.com/test_sets/8ba3a5f4-f882-11e6-887f-5254006e85c2 The OSS console 18:24:42:[47407.437891] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 18:24:15 (1487701455) 18:24:42: 18:24:42:[47411.114011] BUG: unable to handle kernel paging request at ffffffa0864ae0ff 18:24:42:[47411.114011] IP: [<ffffffa0864ae0ff>] 0xffffffa0864ae0fe 18:24:42:[47411.114011] PGD 19bd067 PUD 0 18:24:42:[47411.114011] Oops: 0010 [#1] SMP 18:24:42:[47411.114011] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too cirrus drm_kms_helper crct10dif_pclmul crct10dif_common syscopyarea sysfillrect crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp mii sysimgblt fb_sys_fops ttm ata_piix drm libata i2c_core floppy [last unloaded: lnet_selftest] 18:24:42:[47411.114011] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1 18:24:42:[47411.114011] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 18:24:42:[47411.114011] task: ffffffff819c1460 ti: ffffffff819ac000 task.ti: ffffffff819ac000 18:24:42:[47411.114011] RIP: 0010:[<ffffffa0864ae0ff>] [<ffffffa0864ae0ff>] 0xffffffa0864ae0fe 18:24:42:[47411.114011] RSP: 0018:ffff88007fc03e78 EFLAGS: 00010286 18:24:42:[47411.114011] RAX: ffffffa0864ae0ff RBX: ffffffff81a1a580 RCX: 0000000003090de6 18:24:42:[47411.114011] RDX: ffff880057f9d227 RSI: ffffea000028bc00 RDI: ffff880057f9d227 18:24:42:[47411.114011] RBP: ffff88007fc03ed0 R08: 0000000000019a80 R09: ffffffff81139719 |
| Comment by Amir Shehata (Inactive) [ 20/Jul/17 ] |
|
can we get access to the core and vmlinux for this crash. I'll need that to investigate further. |
| Comment by Sarah Liu [ 20/Jul/17 ] |
|
Amir, we have core dump configed on the cluster, please refer to this wiki for location: |
| Comment by Sarah Liu [ 19/Sep/17 ] |
|
another one on 2.10.1 , for RHEL7.4 server/client ldiskfs crash dump of MDS node [35950.500759] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 02:46:50 \(1505702810\) [35950.698782] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 02:46:50 (1505702810) [35961.771804] LustreError: 1836:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x31fa:0x0] get parent: rc = -2 [35976.340533] LustreError: 1837:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x338d:0x0] get parent: rc = -2 [35976.344929] LustreError: 1837:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [35980.978075] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x32bb:0x0] get parent: rc = -2 [35992.178172] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x34c9:0x0] get parent: rc = -2 [35992.182851] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [36001.007252] LustreError: 1835:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x3650:0x0] get parent: rc = -2 [36001.012130] LustreError: 1835:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 4 previous similar messages [36005.130074] BUG: unable to handle kernel paging request at ffffff810b7290ff [36005.131016] IP: [<ffffff810b7290ff>] 0xffffff810b7290fe [36005.131016] PGD 19f5067 PUD 0 [36005.131016] Oops: 0010 [#1] SMP [36005.131016] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon joydev i2c_piix4 parport_pc nfsd parport nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper [36005.131016] syscopyarea sysfillrect sysimgblt virtio_blk fb_sys_fops ata_piix ttm crct10dif_pclmul crct10dif_common drm crc32c_intel libata 8139too serio_raw virtio_pci virtio_ring floppy virtio 8139cp i2c_core mii [last unloaded: lnet_selftest] [36005.131016] CPU: 1 PID: 13 Comm: ksoftirqd/1 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.x86_64 #1 [36005.131016] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [36005.131016] task: ffff88007c09cf10 ti: ffff88007c0b0000 task.ti: ffff88007c0b0000 [36005.131016] RIP: 0010:[<ffffff810b7290ff>] [<ffffff810b7290ff>] 0xffffff810b7290fe [36005.131016] RSP: 0018:ffff88007c0b3db0 EFLAGS: 00010286 [36005.131016] RAX: ffffff810b7290ff RBX: ffffffff81a550c0 RCX: ffff88007fd10240 [36005.131016] RDX: ffff880051ad4ecf RSI: ffff8800198a1028 RDI: ffff880051ad4ecf [36005.131016] RBP: ffff88007c0b3e08 R08: 0000000000019bc0 R09: ffff88007fd10240 [36005.131016] R10: ffff88007fd19bc0 R11: ffffea00013ff040 R12: 000000000000000a [36005.131016] R13: 0000000000000000 R14: ff88007b9c68d0ff R15: ffff88007fd10260 [36005.131016] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [36005.131016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [36005.131016] CR2: ffffff810b7290ff CR3: 0000000079f5d000 CR4: 00000000000406e0 [36005.131016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [36005.131016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [36005.131016] Stack: [36005.131016] ffffffff81139dad ffff8800198a1028 0000000000000000 ffff88007fd10288 [36005.131016] ffff88007c09cf10 ffff88007c0b3fd8 0000000000000009 0000000000000009 [36005.131016] ffffffff81a27440 ffffffff819eb108 0000000000000001 ffff88007c0b3e78 [36005.131016] Call Trace: [36005.131016] [<ffffffff81139dad>] ? rcu_process_callbacks+0x1dd/0x550 [36005.131016] [<ffffffff81090b3f>] __do_softirq+0xef/0x280 [36005.131016] [<ffffffff81090d08>] run_ksoftirqd+0x38/0x50 [36005.131016] [<ffffffff810b909f>] smpboot_thread_fn+0x12f/0x180 [36005.131016] [<ffffffff810b8f70>] ? lg_double_unlock+0x40/0x40 [36005.131016] [<ffffffff810b098f>] kthread+0xcf/0xe0 [36005.131016] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [36005.131016] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [36005.131016] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [36005.131016] Code: Bad RIP value. [36005.131016] RIP [<ffffff810b7290ff>] 0xffffff810b7290fe [36005.131016] RSP <ffff88007c0b3db0> [36005.131016] CR2: ffffff810b7290ff |
| Comment by Minh Diep [ 08/Jan/18 ] |
|
+1 on 2.10.2 https://testing.hpdd.intel.com/test_sets/096d7c0c-f3f1-11e7-854b-52540065bddc |
| Comment by Saurabh Tandan (Inactive) [ 11/Apr/18 ] |
|
+1 in 2.10.3 https://testing.hpdd.intel.com/test_sets/46461c1a-3d4d-11e8-8f8a-52540065bddc |