[LU-2940] Oops in vvp_pgcache_show() Created: 09/Mar/13  Updated: 17/Jan/14  Resolved: 10/Aug/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Major
Reporter: John Hammond Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: llite, patch, procfs

Severity: 3
Rank (Obsolete): 7058

 Description   

To reproduce start racer and run

while true; do cat /proc/fs/lustre/llite/*/dump_page_cache > /dev/null; done
crash> bt -l
PID: 8892   TASK: ffff880118dc8080  CPU: 1   COMMAND: "cat"
 #0 [ffff88013a59b920] machine_kexec at ffffffff81031f7b
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
 #1 [ffff88013a59b980] crash_kexec at ffffffff810b8c22
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/kernel/kexec.c: 1106
 #2 [ffff88013a59ba50] oops_end at ffffffff814eef80
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/kernel/dumpstack.c: 242
 #3 [ffff88013a59ba80] no_context at ffffffff81042a0b
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/mm/fault.c: 698
 #4 [ffff88013a59bad0] __bad_area_nosemaphore at ffffffff81042c95
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/mm/fault.c: 764
 #5 [ffff88013a59bb20] bad_area at ffffffff81042dbe
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/mm/fault.c: 793
 #6 [ffff88013a59bb50] __do_page_fault at ffffffff81043570
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/mm/fault.c: 1140
 #7 [ffff88013a59bc70] do_page_fault at ffffffff814f0f5e
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/mm/fault.c: 1208
 #8 [ffff88013a59bca0] page_fault at ffffffff814ee315
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86_64/kernel/entry.S
    [exception RIP: vvp_pgcache_show+263]
    RIP: ffffffffa0df5ab7  RSP: ffff88013a59bd58  RFLAGS: 00010246
    RAX: ffff88010f064ac0  RBX: ffffea00039cdd28  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffffffffa0e02b12  RDI: ffff88010f064a00
    RBP: ffff88013a59be18   R8: 00000000fffffffb   R9: 00000000000035bb
    R10: 0000000000000002  R11: 000000000000000f  R12: ffff880132b22440
    R13: ffff88015b589db8  R14: ffff88014ff48948  R15: ffff88010f064a00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88013a59be20] seq_read at ffffffff8119a0a9
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/fs/seq_file.c: 235
#10 [ffff88013a59bea0] proc_reg_read at ffffffff811dcbde
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/fs/proc/inode.c: 202
#11 [ffff88013a59bef0] vfs_read at ffffffff811782b5
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/fs/read_write.c: 295
#12 [ffff88013a59bf30] sys_read at ffffffff811783f1
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/fs/read_write.c: 369
#13 [ffff88013a59bf80] system_call_fastpath at ffffffff8100b072
    /usr/src/debug///////kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_64/arch/x86/kernel/entry_64.S: 489
    RIP: 0000003282eda360  RSP: 00007fffaf2f4050  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: ffffffff8100b072  RCX: 0000000000b68030
    RDX: 0000000000008000  RSI: 0000000000b60000  RDI: 0000000000000003
    RBP: 0000000000b60000   R8: 000000328318dee8   R9: 0000000000000001
    R10: 0000000000008fff  R11: 0000000000000246  R12: ffffffffffff8000
    R13: 0000000000000003  R14: 0000000000008000  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
crash> dis -l vvp_pgcache_show+263
/root/lustre-release/lustre/llite/vvp_dev.c: 420
0xffffffffa0df5ab7 <vvp_pgcache_show+263>:      mov    (%rdx),%rcx


 Comments   
Comment by Gaurav Mahajan (Inactive) [ 29/Jul/13 ]

Hi John,

I tried reproducing the bug but could not reproduce it.

Tried following:

while true; do cat /proc/fs/lustre/llite/*/dump_page_cache > /dev/null; done &

untar kernel tarball on mount point.

Can you please add more information about how to reproduce this bug?

Comment by John Hammond [ 29/Jul/13 ]

I think you need to create a workload with more page teardown:

while true; do cat /proc/fs/lustre/llite/*/dump_page_cache &> /dev/null; done &
MOUNT_2=y ./lustre/tests/llmount.sh
sh ./lustre/tests/racer.sh
Comment by Peter Jones [ 31/Jul/13 ]

I just noticed this patch in gerrit - http://review.whamcloud.com/#/c/7192/

Comment by John Hammond [ 10/Aug/13 ]

Patch landed to master.

Comment by Jian Yu [ 17/Jan/14 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_5/13/

sanity test 118a hung:

== sanity test 118a: verify O_SYNC works ============ 01:54:01 (1389779641)
7+0 records in
7+0 records out
458752 bytes (459 kB) copied, 0.00423288 s, 108 MB/s

Console log on client:

01:54:20:Lustre: DEBUG MARKER: == sanity test 118a: verify O_SYNC works ============ 01:54:01 (1389779641)
01:54:21:BUG: unable to handle kernel NULL pointer dereference at (null)
01:54:21:IP: [<ffffffffa0ac68d7>] vvp_pgcache_show+0x107/0x370 [lustre]
01:54:21:PGD 66733067 PUD 19fc1067 PMD 0 
01:54:21:Oops: 0000 [#1] SMP 
01:54:22:last sysfs file: /sys/devices/system/cpu/possible
01:54:22:CPU 1 
01:54:22:Modules linked in: ext2 lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon i2c_piix4 i2c_core 8139too 8139cp mii ext3 jbd mbcache virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
01:54:23:
01:54:23:Pid: 28122, comm: lctl Not tainted 2.6.32-358.23.2.el6.x86_64 #1 Red Hat KVM
01:54:23:RIP: 0010:[<ffffffffa0ac68d7>]  [<ffffffffa0ac68d7>] vvp_pgcache_show+0x107/0x370 [lustre]
01:54:24:RSP: 0018:ffff88007a96fd58  EFLAGS: 00010246
01:54:24:RAX: ffff88007a9960c0 RBX: ffffea0000eac3d0 RCX: 0000000000000000
01:54:24:RDX: 0000000000000000 RSI: ffffffffa0ad3c90 RDI: ffff88007a996000
01:54:24:RBP: ffff88007a96fe18 R08: 00000000ffffff20 R09: 0000000000000121
01:54:24:R10: 0000000000000002 R11: 0000000000000008 R12: ffff88004b06b940
01:54:25:R13: ffff880064057a70 R14: ffff88006665f840 R15: ffff88007a996000
01:54:25:FS:  00007fe431f94700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
01:54:25:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
01:54:26:CR2: 0000000000000000 CR3: 000000003b217000 CR4: 00000000000006e0
01:54:26:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
01:54:26:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
01:54:27:Process lctl (pid: 28122, threadinfo ffff88007a96e000, task ffff880079e26080)
01:54:27:Stack:
01:54:27: ffff88007a96fd68 ffffffffa03d8b2e ffff88007a96fd88 ffff880064057a68
01:54:27:<d> ffff880064057aa8 ffff880064057a70 ffff88004b06b940 ffff88007a96fe58
01:54:28:<d> ffff88007a96fda8 ffffffffa03d8a4e ffff88007a96fdd8 ffffffffa058d03f
01:54:28:Call Trace:
01:54:28: [<ffffffffa03d8b2e>] ? cfs_free+0xe/0x10 [libcfs]
01:54:29: [<ffffffffa03d8a4e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
01:54:29: [<ffffffffa058d03f>] ? cl_env_put+0x20f/0x370 [obdclass]
01:54:29: [<ffffffff811a54e9>] seq_read+0x289/0x400
01:54:29: [<ffffffff81057520>] ? __dequeue_entity+0x30/0x50
01:54:30: [<ffffffff811e99ae>] proc_reg_read+0x7e/0xc0
01:54:30: [<ffffffff81181ac5>] vfs_read+0xb5/0x1a0
01:54:30: [<ffffffff81181c01>] sys_read+0x51/0x90
01:54:30: [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
01:54:30: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/039a8440-7e7b-11e3-925a-52540035b04c

The patch http://review.whamcloud.com/7192 needs to be back-ported to Lustre b2_4 branch.

Generated at Sat Feb 10 01:29:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.