[LU-3331] Crash during mount/umount during stress testing - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.4.0
Labels:
None
Environment:
Single 2GB VM on my local Machine.

Severity:
3
Rank (Obsolete):
8227

Description

I am doing some testing for Racer in a 2gb in a VM. I am really hitting a non-racer issue related to mount /umount.

I run mem-hog just that it stresses the system without OOMing it. Every 300 seconds or so autotest unmount and remounts the Lustre filesystem as part of it teardown setup phase for running the tests.

I am hitting this:

Pid: 11143, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox
RIP: 0010:[<ffffffffa07c5498>]  [<ffffffffa07c5498>] lprocfs_srch+0x48/0x80 [obdclass]
RSP: 0018:ffff88006c837a58  EFLAGS: 00010286
RAX: ffffffffa084dac0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88006c837b68 RDI: ffffffffa084dac0
RBP: ffff88006c837a78 R08: 00000000fffffffe R09: 0000000000000000
R10: 000000000000000f R11: 000000000000000f R12: ffff88006c837b68
R13: ffffffffffffff8e R14: 0000000000000000 R15: ffff88006c837b68
FS:  00007f27ac959700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffffffffde CR3: 0000000013efd000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount.lustre (pid: 11143, threadinfo ffff88006c836000, task ffff88006c835500)
Stack:
 ffffffffffffff8e ffff88006c837b68 ffffffffffffff8e 0000000000000000
<d> ffff88006c837ac8 ffffffffa07c73e4 ffff88006c837a98 ffffffff81281356
<d> 0000000000000004 0000000affffffff ffff880012804000 0000000000000006
Call Trace:
 [<ffffffffa07c73e4>] lprocfs_register+0x34/0x100 [obdclass]
 [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa0f0db7e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre]
 [<ffffffffa0efd3b6>] client_common_fill_super+0x1a6/0x5280 [lustre]
 [<ffffffff81281640>] ? sprintf+0x40/0x50
 [<ffffffffa0f03204>] ll_fill_super+0xd74/0x1500 [lustre]
 [<ffffffffa07f257d>] lustre_fill_super+0x3dd/0x530 [obdclass]
 [<ffffffffa07f21a0>] ? lustre_fill_super+0x0/0x530 [obdclass]
 [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa07e9de5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff811839c2>] do_kern_mount+0x52/0x130
 [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0
 [<ffffffff811a4300>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

There a few other messages in the logs right before or during the event:

SLAB: cache with size 40 has lost its name
SLAB: cache with size 192 has lost its name
SLAB: cache with size 1216 has lost its name
LustreError: 10029:0:(lprocfs_status.c:489:lprocfs_register())  Lproc: Attempting to register llite more than once 
SLAB: cache with size 256 has lost its name
SLAB: cache with size 40 has lost its name
SLAB: cache with size 192 has lost its name
SLAB: cache with size 1216 has lost its name
SLAB: cache with size 256 has lost its name

The SLAB messages indicate that a module unload did not go as planed.

The "Attempting to register llite more than once" speaks for itself.

I am actively working this issue I just wanted to document it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

obdclass.dmp
7 kB
22/May/13 5:08 PM
single-lu-3331
109 kB
14/May/13 6:24 PM

Activity

People

Assignee:: WC Triage

Reporter:: Keith Mannthey (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/May/13 2:04 AM

Updated:: 24/Sep/14 1:23 PM

Resolved:: 24/Sep/14 1:23 PM