[LU-2400] BUG: unable to handle kernel paging request in lprocfs_srch+0x48/0x80 Created: 28/Nov/12  Updated: 23/Jul/18  Resolved: 23/Jul/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Alex Zhuravlev
Resolution: Won't Fix Votes: 0
Labels: llnl

Severity: 3
Rank (Obsolete): 5695

 Description   

I was running the replay-single test in a loop and eventually hit this:

BUG: unable to handle kernel paging request at ffffffffffffffde
IP: [<ffffffffa12d4f28>] lprocfs_srch+0x48/0x80 [obdclass]
PGD 1a87067 PUD 1a88067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/module/ptlrpc/initstate
CPU 0 
Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_zfs(U) lquota(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) jbd sha512_generic sha256_generic ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc fuse autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate vhost_net macvtap macvlan tun kvm virtio_balloon snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk pata_acpi ata_generic ata_piix virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]

Pid: 2886, comm: mount.lustre Tainted: P           ---------------    2.6.32-279.9.1.1chaos.ch5.1.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffffa12d4f28>]  [<ffffffffa12d4f28>] lprocfs_srch+0x48/0x80 [obdclass]
RSP: 0018:ffff88001cc599a8  EFLAGS: 00010286
RAX: ffffffffa135ccc0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88001cc59ab8 RDI: ffffffffa135ccc0
RBP: ffff88001cc599c8 R08: 00000000fffffffe R09: 0000000000000000
R10: 000000000000000f R11: 000000000000000f R12: ffff88001cc59ab8
R13: ffffffffffffff8e R14: 0000000000000000 R15: ffff88001cc59ab8
FS:  00007fa9dee12700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffffffffde CR3: 000000001bf30000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount.lustre (pid: 2886, threadinfo ffff88001cc58000, task ffff880002053540)
Stack:
 ffffffffffffff8e ffff88001cc59ab8 ffffffffffffff8e 0000000000000000
<d> ffff88001cc59a18 ffffffffa12d7084 ffff88001cc599e8 ffff8800195f6000
<d> ffff88001cc59a88 ffffffff8127d0a6 ffff88001c874000 0000000000000006
Call Trace:
 [<ffffffffa12d7084>] lprocfs_register+0x34/0x100 [obdclass]
 [<ffffffff8127d0a6>] ? vsnprintf+0x2b6/0x5f0
 [<ffffffffa109e3ce>] lprocfs_register_mountpoint+0x12e/0xb80 [lustre]
 [<ffffffffa108f226>] client_common_fill_super+0x1a6/0x4e70 [lustre]
 [<ffffffff8127d420>] ? sprintf+0x40/0x50
 [<ffffffffa1094c64>] ll_fill_super+0xd74/0x1500 [lustre]
 [<ffffffffa130fb6c>] lustre_fill_super+0x12c/0x1ae0 [obdclass]
 [<ffffffff8117cb16>] ? set_anon_super+0x56/0x100
 [<ffffffff8116a03c>] ? pcpu_alloc+0x3ac/0xa50
 [<ffffffff8127a24a>] ? strlcpy+0x4a/0x60
 [<ffffffff8117d9e3>] ? sget+0x3e3/0x480
 [<ffffffff8117cac0>] ? set_anon_super+0x0/0x100
 [<ffffffffa130fa40>] ? lustre_fill_super+0x0/0x1ae0 [obdclass]
 [<ffffffff8117df2f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa12fb185>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8117db8b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8117dd32>] do_kern_mount+0x52/0x130
 [<ffffffff8119c442>] do_mount+0x2d2/0x8d0
 [<ffffffff8119cad0>] sys_mount+0x90/0xe0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: a1 49 89 f4 e8 ca b0 22 e0 4d 85 ed 75 1d 48 c7 c7 c0 cc 35 a1 e8 69 2b dc df 48 83 c4 08 48 89 d8 5b 41 5c 41 5d c9 c3 0f 1f 00 <49> 8b 5d 50 48 85 db 75 10 eb d8 0f 1f 44 00 00 48 8b 5b 40 48 
RIP  [<ffffffffa12d4f28>] lprocfs_srch+0x48/0x80 [obdclass]
 RSP <ffff88001cc599a8>
CR2: ffffffffffffffde
---[ end trace 0ce3c640bfa74514 ]---
Kernel panic - not syncing: Fatal exception
Pid: 2886, comm: mount.lustre Tainted: P      D    ---------------    2.6.32-279.9.1.1chaos.ch5.1.x86_64 #1
Call Trace:
 [<ffffffff814fdceb>] ? panic+0xa0/0x168
 [<ffffffff81501e84>] ? oops_end+0xe4/0x100
 [<ffffffff81043bfb>] ? no_context+0xfb/0x260
 [<ffffffff81043e85>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81043f53>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810446b1>] ? __do_page_fault+0x321/0x480
 [<ffffffff8127c23d>] ? pointer+0x8d/0x830
 [<ffffffffa06ff799>] ? _ldlm_lock_debug+0x189/0x5d0 [ptlrpc]
 [<ffffffff81503e3e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff815011f5>] ? page_fault+0x25/0x30
 [<ffffffffa12d4f28>] ? lprocfs_srch+0x48/0x80 [obdclass]
 [<ffffffffa12d4f06>] ? lprocfs_srch+0x26/0x80 [obdclass]
 [<ffffffffa12d7084>] ? lprocfs_register+0x34/0x100 [obdclass]
 [<ffffffff8127d0a6>] ? vsnprintf+0x2b6/0x5f0
 [<ffffffffa109e3ce>] ? lprocfs_register_mountpoint+0x12e/0xb80 [lustre]
 [<ffffffffa108f226>] ? client_common_fill_super+0x1a6/0x4e70 [lustre]
 [<ffffffff8127d420>] ? sprintf+0x40/0x50
 [<ffffffffa1094c64>] ? ll_fill_super+0xd74/0x1500 [lustre]
 [<ffffffffa130fb6c>] ? lustre_fill_super+0x12c/0x1ae0 [obdclass]
 [<ffffffff8117cb16>] ? set_anon_super+0x56/0x100
 [<ffffffff8116a03c>] ? pcpu_alloc+0x3ac/0xa50
 [<ffffffff8127a24a>] ? strlcpy+0x4a/0x60
 [<ffffffff8117d9e3>] ? sget+0x3e3/0x480
 [<ffffffff8117cac0>] ? set_anon_super+0x0/0x100
 [<ffffffffa130fa40>] ? lustre_fill_super+0x0/0x1ae0 [obdclass]
 [<ffffffff8117df2f>] ? get_sb_nodev+0x5f/0xa0
 [<ffffffffa12fb185>] ? lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8117db8b>] ? vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8117dd32>] ? do_kern_mount+0x52/0x130
 [<ffffffff8119c442>] ? do_mount+0x2d2/0x8d0
 [<ffffffff8119cad0>] ? sys_mount+0x90/0xe0
 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

The reproducer:

$ while true; sudo FSTYPE=zfs ONLY="59 60" sh ./lustre/tests/replay-single.sh; done

It took about 6 hours to hit in my single node VM.

There are also some other messages on the console just prior to the crash which I don't totally understand but I'll post here just for completeness:

SLAB: cache with size 256 has lost its name
... snip ...
SLAB: cache with size 1152 has lost its name
LustreError: 958:0:(lprocfs_status.c:492:lprocfs_register())  Lproc: Attempting to register llite more than once 
SLAB: cache with size 256 has lost its name
... snip ...
SLAB: cache with size 1152 has lost its name
kmem_cache_create: duplicate cache lustre_inode_cache
SLAB: cache with size 256 has lost its name
... snip ...
SLAB: cache with size 1152 has lost its name
LustreError: 2522:0:(lprocfs_status.c:492:lprocfs_register())  Lproc: Attempting to register llite more than once 
SLAB: cache with size 256 has lost its name
... snip ...
SLAB: cache with size 1152 has lost its name
LustreError: 11-0: lustre-MDT0000-osp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11
LustreError: 11-0: lustre-MDT0000-osp-OST0000: Communicating with 0@lo, operation mds_connect failed with -11


 Comments   
Comment by Peter Jones [ 29/Nov/12 ]

Alex, what do you think about this one?

Comment by Andreas Dilger [ 30/Nov/12 ]

Looks like there was a leak in some of the slabs, and then remounting caused the client to be unhappy. I don't think this really has anything to do with Orion code changes.

Comment by Alex Zhuravlev [ 17/Jan/13 ]

Oleg fixed a leak ( see http://review.whamcloud.com/#change,4150 ), but it was on Oct 6 ..
any new instances?

Comment by Prakash Surya (Inactive) [ 17/Jan/13 ]

No, I only hit this the one time, but I haven't been trying to hit it either. I could let the reproducer run over night in a VM if you'd like, to see if it triggers. Would that be useful?

Comment by Alex Zhuravlev [ 17/Jan/13 ]

yes, please do if you have a chance.

Comment by Alex Zhuravlev [ 23/Jul/18 ]

no more reports

Generated at Sat Feb 10 01:24:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.