[LU-4678] obdfilter-survey repeated run causes crash on OST Created: 27/Feb/14  Updated: 03/Mar/14  Resolved: 03/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Nathaniel Clark Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre-master #1909


Issue Links:
Related
is related to LU-3319 Adapt to 3.10 upstream kernel proc_di... Resolved
Severity: 3
Rank (Obsolete): 12853

 Description   

Rerunning obdfilter-survey.sh on ldiskfs without reboot causes the following BUG on the OST:

BUG: unable to handle kernel paging request at 00000000deadbeef
IP: [<ffffffff8128a002>] strlen+0x2/0x30
PGD 12017067 PUD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/module/lquota/initstate
CPU 0 
Modules linked in: osd_ldiskfs(+)(U) ldiskfs(U) lquota(U) lfsck(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) ptlrpc_gss(U) ost(U) obdecho(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) exportfs jbd sha512_generic sha256_generic crc32c_intel nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 ppdev parport_pc parport zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate btusb bluetooth rfkill snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 vmware_balloon sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ptlrpc_gss]

Pid: 21719, comm: modprobe Tainted: P           ---------------    2.6.32-431.5.1.el6_lustre.gb1c0d36.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff8128a002>]  [<ffffffff8128a002>] strlen+0x2/0x30
RSP: 0018:ffff880023f37e70  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000deadbeef RCX: 0000000000000000
RDX: 000000000000000d RSI: ffff8800089ef5c0 RDI: 00000000deadbeef
RBP: ffff880023f37ea8 R08: 0000000000000002 R09: 0000000000000000
R10: ffff88002c163920 R11: 000000000000000c R12: ffffffffa225ab09
R13: ffff8800089ef5c0 R14: 00000000deadbeef R15: 00000000deadbeef
FS:  00007f3759d5e700(0000) GS:ffff880003200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000deadbeef CR3: 00000000270d0000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 21719, threadinfo ffff880023f36000, task ffff88002b427540)
Stack:
 ffffffffa13fc5fe ffff880023f37ea8 0000000000000000 ffffffffa225ab09
<d> ffff880011b8da40 ffffffffa225ab09 00000000ffffff8e ffff880023f37ef8
<d> ffffffffa13f503b 0000000000000000 ffffffffa22633e0 ffff880023f37ef8
Call Trace:
 [<ffffffffa13fc5fe>] ? lprocfs_try_remove_proc_entry+0x2e/0x130 [obdclass]
 [<ffffffffa13f503b>] class_register_type+0x5cb/0xe10 [obdclass]
 [<ffffffffa228a000>] ? osd_mod_init+0x0/0x5a [osd_ldiskfs]
 [<ffffffffa228a042>] osd_mod_init+0x42/0x5a [osd_ldiskfs]
 [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
 [<ffffffff810bc521>] sys_init_module+0xe1/0x250
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: 01 00 0f b6 10 f6 82 a0 29 af 81 20 74 13 0f 1f 00 48 83 c0 01 0f b6 10 f6 82 a0 29 af 81 20 75 f0 c9 c3 66 0f 1f 44 00 00 31 c0 <80> 3f 00 55 48 89 fa 48 89 e5 74 11 66 90 48 83 c2 01 80 3a 00 
RIP  [<ffffffff8128a002>] strlen+0x2/0x30
 RSP <ffff880023f37e70>
CR2: 00000000deadbeef


 Comments   
Comment by James A Simmons [ 03/Mar/14 ]

I see the error. It appears that class_register_type is failing to load osd-ldisks. Since it is failing the code jumps to failed: and the first thing it does is free up obd_type->typ_name which is needed for the later call to lprocfs_try_remove_proc_entry. Will submit a patch soon.

Comment by James A Simmons [ 03/Mar/14 ]

I integrated this fix into http://review.whamcloud.com/#/c/9038. That patch was fixing other problems with class_register_type already. Can someone link this to LU-3319.

Comment by Peter Jones [ 03/Mar/14 ]

ok James. I have marked this as a duplicate of LU-3319 to reflect that the fix is tracked there.

Generated at Sat Feb 10 01:44:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.