[LU-8584] parallel-scale-nfsv3 test_racer_on_nfs: BUG: unable to handle kernel paging request Created: 06/Sep/16  Updated: 24/Apr/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.4, Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Saurabh Tandan (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Interop: 2.8.0 EL7.2 Server/EL7.2 Client
Server: b2_8, build# 12
Client: master, build# 3431


Issue Links:
Related
is related to LU-11718 parallel-scale-nfsv3 test racer_on_nf... Open
is related to LU-11766 parallel-scale-nfsv3 test racer_on_nf... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/e91156c4-7276-11e6-8afd-5254006e85c2.

The sub-test test_racer_on_nfs failed with the following error:

test failed to respond and timed out

test log:

./file_exec.sh: line 13: 23336 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13: 25007 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13: 26378 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13: 31986 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13:  1896 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13:  1981 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13:  2975 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
./file_exec.sh: line 13:  6421 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
racer cleanup
sleeping 5 sec ...
file_create.sh: no process found
dir_create.sh: no process found
file_rm.sh: no process found
file_rename.sh: no process found
file_link.sh: no process found
file_symlink.sh: no process found
file_list.sh: no process found
file_concat.sh: no process found
file_exec.sh: no process found
file_chown.sh: no process found
file_chmod.sh: no process found
file_mknod.sh: no process found
file_truncate.sh: no process found
file_delxattr.sh: no process found
file_getxattr.sh: no process found
file_setxattr.sh: no process found
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

Client dmesg:

df              D ffff88005170fa80     0  6802  12502 0x00000080
[ 7070.189011]  ffff88005170fa08 0000000000000082 ffff8800799c5c00 ffff88005170ffd8
[ 7070.189011]  ffff88005170ffd8 ffff88005170ffd8 ffff8800799c5c00 ffff8800799c5c00
[ 7070.189011]  ffff88007ff66c58 0000000000000082 ffffffffa0225d20 ffff88005170fa80
[ 7070.189011] Call Trace:
[ 7070.189011]  [<ffffffffa0225d20>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 7070.189011]  [<ffffffff8163bc59>] schedule+0x29/0x70
[ 7070.189011]  [<ffffffffa0225d55>] rpc_wait_bit_killable+0x35/0x80 [sunrpc]
[ 7070.189011]  [<ffffffff81639ad0>] __wait_on_bit+0x60/0x90
[ 7070.189011]  [<ffffffffa0225d20>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 7070.189011]  [<ffffffff81639b87>] out_of_line_wait_on_bit+0x87/0xb0
[ 7070.189011]  [<ffffffff810a6ba0>] ? wake_atomic_t_function+0x40/0x40
[ 7070.189011]  [<ffffffffa021cfe0>] ? call_bind_status+0x290/0x290 [sunrpc]
[ 7070.189011]  [<ffffffffa021cfe0>] ? call_bind_status+0x290/0x290 [sunrpc]
[ 7070.189011]  [<ffffffffa0226c44>] __rpc_execute+0x154/0x420 [sunrpc]
[ 7070.189011]  [<ffffffff810a6ae5>] ? wake_up_bit+0x25/0x30
[ 7070.189011]  [<ffffffffa022877e>] rpc_execute+0x5e/0xa0 [sunrpc]
[ 7070.189011]  [<ffffffffa021e240>] rpc_run_task+0x70/0x90 [sunrpc]
[ 7070.189011]  [<ffffffffa021e2b0>] rpc_call_sync+0x50/0xc0 [sunrpc]
[ 7070.189011]  [<ffffffffa06bf5cb>] nfs3_rpc_wrapper.constprop.10+0x6b/0xb0 [nfsv3]
[ 7070.189011]  [<ffffffffa06c02a6>] nfs3_proc_getattr+0x56/0xb0 [nfsv3]
[ 7070.189011]  [<ffffffffa0514f7f>] __nfs_revalidate_inode+0xbf/0x310 [nfs]
[ 7070.189011]  [<ffffffffa0515722>] nfs_revalidate_inode+0x22/0x60 [nfs]
[ 7070.189011]  [<ffffffff811e9bdc>] ? lookup_fast+0xcc/0x2e0
[ 7070.189011]  [<ffffffffa050cc5b>] nfs_weak_revalidate+0x4b/0xf0 [nfs]
[ 7070.189011]  [<ffffffff811e9257>] complete_walk+0x87/0xe0
[ 7070.189011]  [<ffffffff811ec21f>] do_last+0x46f/0x1270
[ 7070.189011]  [<ffffffff812f96f2>] ? radix_tree_lookup_slot+0x22/0x50
[ 7070.189011]  [<ffffffff811eed02>] path_openat+0xc2/0x490
[ 7070.189011]  [<ffffffff811f04cb>] do_filp_open+0x4b/0xb0
[ 7070.189011]  [<ffffffff811fd057>] ? __alloc_fd+0xa7/0x130
[ 7070.189011]  [<ffffffff811dde73>] do_sys_open+0xf3/0x1f0
[ 7070.189011]  [<ffffffff811ddf8e>] SyS_open+0x1e/0x20
[ 7070.189011]  [<ffffffff81646c49>] system_call_fastpath+0x16/0x1b

MDS console:

20:40:23:[ 3465.447651] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 20:35:50 (1472960150)
20:40:23:[ 3499.433180] LustreError: 10949:0:(mdt_handler.c:893:mdt_getattr_internal()) lustre-MDT0000: getattr error for [0x200000bd2:0x1800:0x0]: rc = -2
20:40:23:[ 3499.438171] LustreError: 10949:0:(mdt_handler.c:893:mdt_getattr_internal()) Skipped 1 previous similar message
20:40:23:[ 3735.618177] BUG: unable to handle kernel paging request at ffffffa07d7040ff
20:40:23:[ 3735.619144] IP: [<ffffffa07d7040ff>] 0xffffffa07d7040fe
20:40:23:[ 3735.619144] PGD 194d067 PUD 0 
20:40:23:[ 3735.619144] Oops: 0010 [#1] SMP 
20:40:23:[ 3735.619144] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm 8139too ata_piix libata drm serio_raw virtio_pci virtio_ring virtio 8139cp mii i2c_core floppy
20:40:23:[ 3735.619144] CPU: 0 PID: 13 Comm: rcuos/1 Tainted: G           OE  ------------   3.10.0-327.3.1.el7_lustre.x86_64 #1
20:40:23:[ 3735.619144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
20:40:23:[ 3735.619144] task: ffff88007c058b80 ti: ffff88007c060000 task.ti: ffff88007c060000
20:40:23:[ 3735.619144] RIP: 0010:[<ffffffa07d7040ff>]  [<ffffffa07d7040ff>] 0xffffffa07d7040fe
20:40:23:[ 3735.619144] RSP: 0018:ffff88007c063e10  EFLAGS: 00010286
20:40:23:[ 3735.619144] RAX: ffff88007c063fd8 RBX: ffff880040551227 RCX: 0000000180400035
20:40:23:[ 3735.619144] RDX: ffffffa07d7040ff RSI: ffffea000188a4c0 RDI: ffff880040551227
20:40:23:[ 3735.619144] RBP: ffff88007c063ec0 R08: ffff880062293840 R09: 0000000180400035
20:40:23:[ 3735.619144] R10: ffffea000188a4c0 R11: ffffffff81123f70 R12: 0000000000000001
20:40:23:[ 3735.619144] R13: ff880008399070ff R14: 0000000000000039 R15: ffff880039ef1750
20:40:23:[ 3735.619144] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
20:40:23:[ 3735.619144] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
20:40:23:[ 3735.619144] CR2: ffffffa07d7040ff CR3: 00000000364cf000 CR4: 00000000000006f0
20:40:23:[ 3735.619144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
20:40:23:[ 3735.619144] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

This issue was first seen on 08/23/2016



 Comments   
Comment by Sarah Liu [ 12/Sep/16 ]

it also failed under regular config, RHEL7.2 server and client
https://testing.hpdd.intel.com/test_sessions/8003512e-724d-11e6-8a8c-5254006e85c2

Comment by Sarah Liu [ 09/Mar/17 ]

another instance seen on master: https://testing.hpdd.intel.com/test_sets/8ba3a5f4-f882-11e6-887f-5254006e85c2

The OSS console

18:24:42:[47407.437891] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 18:24:15 (1487701455)
18:24:42:
18:24:42:[47411.114011] BUG: unable to handle kernel paging request at ffffffa0864ae0ff
18:24:42:[47411.114011] IP: [<ffffffa0864ae0ff>] 0xffffffa0864ae0fe
18:24:42:[47411.114011] PGD 19bd067 PUD 0 
18:24:42:[47411.114011] Oops: 0010 [#1] SMP 
18:24:42:[47411.114011] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too cirrus drm_kms_helper crct10dif_pclmul crct10dif_common syscopyarea sysfillrect crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp mii sysimgblt fb_sys_fops ttm ata_piix drm libata i2c_core floppy [last unloaded: lnet_selftest]
18:24:42:[47411.114011] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
18:24:42:[47411.114011] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
18:24:42:[47411.114011] task: ffffffff819c1460 ti: ffffffff819ac000 task.ti: ffffffff819ac000
18:24:42:[47411.114011] RIP: 0010:[<ffffffa0864ae0ff>]  [<ffffffa0864ae0ff>] 0xffffffa0864ae0fe
18:24:42:[47411.114011] RSP: 0018:ffff88007fc03e78  EFLAGS: 00010286
18:24:42:[47411.114011] RAX: ffffffa0864ae0ff RBX: ffffffff81a1a580 RCX: 0000000003090de6
18:24:42:[47411.114011] RDX: ffff880057f9d227 RSI: ffffea000028bc00 RDI: ffff880057f9d227
18:24:42:[47411.114011] RBP: ffff88007fc03ed0 R08: 0000000000019a80 R09: ffffffff81139719
Comment by Amir Shehata (Inactive) [ 20/Jul/17 ]

can we get access to the core and vmlinux for this crash. I'll need that to investigate further.

Comment by Sarah Liu [ 20/Jul/17 ]

Amir, we have core dump configed on the cluster, please refer to this wiki for location:
https://wiki.hpdd.intel.com/display/TEI/Core+dump+location+for+autotest+nodes

Comment by Sarah Liu [ 19/Sep/17 ]

another one on 2.10.1 , for RHEL7.4 server/client ldiskfs
https://testing.hpdd.intel.com/test_sets/bfce3ad8-9c2e-11e7-b778-5254006e85c2

crash dump of MDS node
/scratch/dumps/trevis-46vm3.trevis.hpdd.intel.com/10.9.6.30-2017-09-18-02:47:50

[35950.500759] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 02:46:50 \(1505702810\)
[35950.698782] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 02:46:50 (1505702810)
[35961.771804] LustreError: 1836:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x31fa:0x0] get parent: rc = -2
[35976.340533] LustreError: 1837:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x338d:0x0] get parent: rc = -2
[35976.344929] LustreError: 1837:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message
[35980.978075] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x32bb:0x0] get parent: rc = -2
[35992.178172] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x34c9:0x0] get parent: rc = -2
[35992.182851] LustreError: 1839:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message
[36001.007252] LustreError: 1835:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x2000255c2:0x3650:0x0] get parent: rc = -2
[36001.012130] LustreError: 1835:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 4 previous similar messages
[36005.130074] BUG: unable to handle kernel paging request at ffffff810b7290ff
[36005.131016] IP: [<ffffff810b7290ff>] 0xffffff810b7290fe
[36005.131016] PGD 19f5067 PUD 0 
[36005.131016] Oops: 0010 [#1] SMP 
[36005.131016] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr virtio_balloon joydev i2c_piix4 parport_pc nfsd parport nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper
[36005.131016]  syscopyarea sysfillrect sysimgblt virtio_blk fb_sys_fops ata_piix ttm crct10dif_pclmul crct10dif_common drm crc32c_intel libata 8139too serio_raw virtio_pci virtio_ring floppy virtio 8139cp i2c_core mii [last unloaded: lnet_selftest]
[36005.131016] CPU: 1 PID: 13 Comm: ksoftirqd/1 Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.x86_64 #1
[36005.131016] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[36005.131016] task: ffff88007c09cf10 ti: ffff88007c0b0000 task.ti: ffff88007c0b0000
[36005.131016] RIP: 0010:[<ffffff810b7290ff>]  [<ffffff810b7290ff>] 0xffffff810b7290fe
[36005.131016] RSP: 0018:ffff88007c0b3db0  EFLAGS: 00010286
[36005.131016] RAX: ffffff810b7290ff RBX: ffffffff81a550c0 RCX: ffff88007fd10240
[36005.131016] RDX: ffff880051ad4ecf RSI: ffff8800198a1028 RDI: ffff880051ad4ecf
[36005.131016] RBP: ffff88007c0b3e08 R08: 0000000000019bc0 R09: ffff88007fd10240
[36005.131016] R10: ffff88007fd19bc0 R11: ffffea00013ff040 R12: 000000000000000a
[36005.131016] R13: 0000000000000000 R14: ff88007b9c68d0ff R15: ffff88007fd10260
[36005.131016] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[36005.131016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[36005.131016] CR2: ffffff810b7290ff CR3: 0000000079f5d000 CR4: 00000000000406e0
[36005.131016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[36005.131016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[36005.131016] Stack:
[36005.131016]  ffffffff81139dad ffff8800198a1028 0000000000000000 ffff88007fd10288
[36005.131016]  ffff88007c09cf10 ffff88007c0b3fd8 0000000000000009 0000000000000009
[36005.131016]  ffffffff81a27440 ffffffff819eb108 0000000000000001 ffff88007c0b3e78
[36005.131016] Call Trace:
[36005.131016]  [<ffffffff81139dad>] ? rcu_process_callbacks+0x1dd/0x550
[36005.131016]  [<ffffffff81090b3f>] __do_softirq+0xef/0x280
[36005.131016]  [<ffffffff81090d08>] run_ksoftirqd+0x38/0x50
[36005.131016]  [<ffffffff810b909f>] smpboot_thread_fn+0x12f/0x180
[36005.131016]  [<ffffffff810b8f70>] ? lg_double_unlock+0x40/0x40
[36005.131016]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[36005.131016]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[36005.131016]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[36005.131016]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[36005.131016] Code:  Bad RIP value.
[36005.131016] RIP  [<ffffff810b7290ff>] 0xffffff810b7290fe
[36005.131016]  RSP <ffff88007c0b3db0>
[36005.131016] CR2: ffffff810b7290ff

Comment by Minh Diep [ 08/Jan/18 ]

+1 on 2.10.2

https://testing.hpdd.intel.com/test_sets/096d7c0c-f3f1-11e7-854b-52540065bddc

Comment by Saurabh Tandan (Inactive) [ 11/Apr/18 ]

+1 in 2.10.3

https://testing.hpdd.intel.com/test_sets/46461c1a-3d4d-11e8-8f8a-52540065bddc

Generated at Sat Feb 10 02:18:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.