[LU-1823] sanity/103: slab corruption - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Labels:
None
Environment:
CONFIG_DEBUG_SLAB=y

Severity:
3
Rank (Obsolete):
4237

Description

Lustre: DEBUG MARKER: == sanity test 103: acl test ========================================================================= 19:57:07 (1346774227)
/work/lustre/head/clean/lustre/utils/l_getidentity
Slab corruption (Tainted: P --------------- ): size-2048 start=dac6c470, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
310: 02 00 00 00 01 00 07 00 ff ff ff ff 02 00 05 00
320: 01 00 00 00 02 00 07 00 02 00 00 00 04 00 07 00
330: ff ff ff ff 10 00 07 00 ff ff ff ff 20 00 05 00
340: ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=dac6cc88, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<dff39e58>](cfs_free+0x8/0x10 [libcfs])
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

02000000:00000010:1.0:1346774231.327841:1804:3373:0:(sec_null.c:217:null_alloc_repbuf()) kmalloced 'req->rq_repbuf': 2048 at dac6c470.
...

02000000:00000010:1.0:1346774231.328361:836:3373:0:(sec_null.c:231:null_free_repbuf()) kfreed 'req->rq_repbuf': 2048 at dac6c470.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdt-serial.log
894 kB
09/Sep/12 5:23 PM

Issue Links

is duplicated by

LU-1235 timeout in sanity subtest 103,unable to handle kernel paging request

Resolved

is related to

LU-1877 BUG: spinlock bad magic on CPU#0, mdt00_002/4625 on mount

Resolved

LU-1844 sanityn, subtest test_16: list_del corruption when run ofd + ldiskfs

Closed

Activity

[LU-1823] sanity/103: slab corruption

Andreas Dilger added a comment - 10/Sep/12 2:18 PM - edited

Keith, can you please fix Yu Jian's patches that hit build failures.

The 2.3.50 patch failed to build due to built-in version checks, so it needs to be rebased one patch later (git hash 388111848489ef99b1fa31ce8fef255ab9c08e84). I haven't investigated the other failure, but hopefully it is similarly trivial. Please get to this ASAP so that the testing can be started on these patches, and hopefully we can isolate this serious defect more quickly.

Andreas Dilger added a comment - 10/Sep/12 2:18 PM - edited Keith, can you please fix Yu Jian's patches that hit build failures. The 2.3.50 patch failed to build due to built-in version checks, so it needs to be rebased one patch later (git hash 388111848489ef99b1fa31ce8fef255ab9c08e84). I haven't investigated the other failure, but hopefully it is similarly trivial. Please get to this ASAP so that the testing can be started on these patches, and hopefully we can isolate this serious defect more quickly.

Jian Yu added a comment - 10/Sep/12 9:02 AM

Hi Keith,

I created several test patches per the following comments from Andreas:

If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started.

Patch on tag 2.2.94: http://review.whamcloud.com/#change,3921
Patch on tag 2.3.50: http://review.whamcloud.com/#change,3918
Patch on tag 2.2.93: http://review.whamcloud.com/#change,3919
Patch on tag 2.2.92: http://review.whamcloud.com/#change,3920

Hope we can isolate the issue.

Jian Yu added a comment - 10/Sep/12 9:02 AM Hi Keith, I created several test patches per the following comments from Andreas: If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started. Patch on tag 2.2.94: http://review.whamcloud.com/#change,3921 Patch on tag 2.3.50: http://review.whamcloud.com/#change,3918 Patch on tag 2.2.93: http://review.whamcloud.com/#change,3919 Patch on tag 2.2.92: http://review.whamcloud.com/#change,3920 Hope we can isolate the issue.

Keith Mannthey (Inactive) added a comment - 09/Sep/12 5:23 PM

Keith local vm MDS panic -v1 dmesg

Keith Mannthey (Inactive) added a comment - 09/Sep/12 5:23 PM Keith local vm MDS panic -v1 dmesg

Keith Mannthey (Inactive) added a comment - 09/Sep/12 5:21 PM - edited

I acquired some torro nods today and am starting to setup. My mds vm crashed while running " REFORMAT=y ONLY=103 sh sanity.sh", it took about 30 hours to trigger). This could be the bad cfs_free path that is corrupting the slab.

I will try and attach the whole dmesg.

This was master + kernel-2.6.32-279 on the MDS vm node.

 
 Lustre: DEBUG MARKER: == sanity test 103: acl test =========================================== 06:06:43 (1347109603)
kfree_debugcheck: out of range ptr 6000100000002h.
------------[ cut here ]------------
kernel BUG at mm/slab.c:2911!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/PNP0C0A:00/power_supply/BAT0/energy_full
CPU 0
Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 sunrpc ipv6 ppdev parport_pc parport microcode i2c_piix4 i2c_core snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 24218, comm: jbd2/dm-2-8 Not tainted 2.6.32.masterDEBUG11A #1 innotek GmbH VirtualBox
RIP: 0010:[<ffffffff81162530>]  [<ffffffff81162530>] kfree_debugcheck+0x30/0x40
RSP: 0018:ffff88002733dba0  EFLAGS: 00010082
RAX: 0000000000000039 RBX: 0006000100000002 RCX: 0000000000007a74
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
RBP: ffff88002733dbb0 R08: 0000000000000000 R09: ffffffff8163acc0
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000202
R13: 0006000100000002 R14: ffff880024d9d298 R15: ffff880024d9d298
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003ac2ef5170 CR3: 000000003d0e0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process jbd2/dm-2-8 (pid: 24218, threadinfo ffff88002733c000, task ffff88003d640ae0)
Stack:
 ffff880000000020 ffffffffa035ebae ffff88002733dc00 ffffffff8116594b
<d> ffff88002851f720 ffff88003f810080 ffff88002733dc20 0006000100000002
<d> ffff880024d9d240 0000000000000000 ffff880024d9d298 ffff880024d9d298
Call Trace:
 [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffff8116594b>] kfree+0x5b/0x2a0
 [<ffffffffa035ebae>] cfs_free+0xe/0x10 [libcfs]
 [<ffffffffa04ceb73>] lu_global_key_fini+0xa3/0xf0 [obdclass]
 [<ffffffffa04cf380>] key_fini+0x60/0x190 [obdclass]
 [<ffffffffa04cf4df>] keys_fini+0x2f/0x120 [obdclass]
 [<ffffffffa04cf5fd>] lu_context_fini+0x2d/0xc0 [obdclass]
 [<ffffffffa0b86aa2>] osd_trans_commit_cb+0xe2/0x2b0 [osd_ldiskfs]
 [<ffffffffa0a3f21a>] ldiskfs_journal_commit_callback+0x8a/0xc0 [ldiskfs]
 [<ffffffffa00a18af>] jbd2_journal_commit_transaction+0x110f/0x1530 [jbd2]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa00a7128>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff81091d66>] kthread+0x96/0xa0
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 7a 67 ee ff 84 c0 74 07 48 83 c4 08 5b c9 c3 48 89 de 48 c7 c7 c8 0b 7a 81 e8 ed cc 39 00 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41
RIP  [<ffffffff81162530>] kfree_debugcheck+0x30/0x40
 RSP <ffff88002733dba0>
---[ end trace ff4011ce2a20c79c ]---
Kernel panic - not syncing: Fatal exception
Pid: 24218, comm: jbd2/dm-2-8 Tainted: G      D    ---------------    2.6.32.masterDEBUG11A #1
Call Trace:
 [<ffffffff814ff155>] ? panic+0xa0/0x168
 [<ffffffff815032e4>] ? oops_end+0xe4/0x100
 [<ffffffff8100f26b>] ? die+0x5b/0x90
 [<ffffffff81502bb4>] ? do_trap+0xc4/0x160
 [<ffffffff8100ce35>] ? do_invalid_op+0x95/0xb0
 [<ffffffff81162530>] ? kfree_debugcheck+0x30/0x40
 [<ffffffffa036def3>] ? libcfs_debug_vmsg2+0x4e3/0xb60 [libcfs]
 [<ffffffff8100bedb>] ? invalid_op+0x1b/0x20
 [<ffffffff81162530>] ? kfree_debugcheck+0x30/0x40
 [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffff8116594b>] ? kfree+0x5b/0x2a0
 [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs]
 [<ffffffffa04ceb73>] ? lu_global_key_fini+0xa3/0xf0 [obdclass]
 [<ffffffffa04cf380>] ? key_fini+0x60/0x190 [obdclass]
 [<ffffffffa04cf4df>] ? keys_fini+0x2f/0x120 [obdclass]
 [<ffffffffa04cf5fd>] ? lu_context_fini+0x2d/0xc0 [obdclass]
 [<ffffffffa0b86aa2>] ? osd_trans_commit_cb+0xe2/0x2b0 [osd_ldiskfs]
 [<ffffffffa0a3f21a>] ? ldiskfs_journal_commit_callback+0x8a/0xc0 [ldiskfs]
 [<ffffffffa00a18af>] ? jbd2_journal_commit_transaction+0x110f/0x1530 [jbd2]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa00a7128>] ? kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00a7070>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

Keith Mannthey (Inactive) added a comment - 09/Sep/12 5:21 PM - edited I acquired some torro nods today and am starting to setup. My mds vm crashed while running " REFORMAT=y ONLY=103 sh sanity.sh", it took about 30 hours to trigger). This could be the bad cfs_free path that is corrupting the slab. I will try and attach the whole dmesg. This was master + kernel-2.6.32-279 on the MDS vm node. Lustre: DEBUG MARKER: == sanity test 103: acl test =========================================== 06:06:43 (1347109603) kfree_debugcheck: out of range ptr 6000100000002h. ------------[ cut here ]------------ kernel BUG at mm/slab.c:2911! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/PNP0C0A:00/power_supply/BAT0/energy_full CPU 0 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) autofs4 sunrpc ipv6 ppdev parport_pc parport microcode i2c_piix4 i2c_core snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 24218, comm: jbd2/dm-2-8 Not tainted 2.6.32.masterDEBUG11A #1 innotek GmbH VirtualBox RIP: 0010:[<ffffffff81162530>] [<ffffffff81162530>] kfree_debugcheck+0x30/0x40 RSP: 0018:ffff88002733dba0 EFLAGS: 00010082 RAX: 0000000000000039 RBX: 0006000100000002 RCX: 0000000000007a74 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 RBP: ffff88002733dbb0 R08: 0000000000000000 R09: ffffffff8163acc0 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000202 R13: 0006000100000002 R14: ffff880024d9d298 R15: ffff880024d9d298 FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003ac2ef5170 CR3: 000000003d0e0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process jbd2/dm-2-8 (pid: 24218, threadinfo ffff88002733c000, task ffff88003d640ae0) Stack: ffff880000000020 ffffffffa035ebae ffff88002733dc00 ffffffff8116594b <d> ffff88002851f720 ffff88003f810080 ffff88002733dc20 0006000100000002 <d> ffff880024d9d240 0000000000000000 ffff880024d9d298 ffff880024d9d298 Call Trace: [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs] [<ffffffff8116594b>] kfree+0x5b/0x2a0 [<ffffffffa035ebae>] cfs_free+0xe/0x10 [libcfs] [<ffffffffa04ceb73>] lu_global_key_fini+0xa3/0xf0 [obdclass] [<ffffffffa04cf380>] key_fini+0x60/0x190 [obdclass] [<ffffffffa04cf4df>] keys_fini+0x2f/0x120 [obdclass] [<ffffffffa04cf5fd>] lu_context_fini+0x2d/0xc0 [obdclass] [<ffffffffa0b86aa2>] osd_trans_commit_cb+0xe2/0x2b0 [osd_ldiskfs] [<ffffffffa0a3f21a>] ldiskfs_journal_commit_callback+0x8a/0xc0 [ldiskfs] [<ffffffffa00a18af>] jbd2_journal_commit_transaction+0x110f/0x1530 [jbd2] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0 [<ffffffffa00a7128>] kjournald2+0xb8/0x220 [jbd2] [<ffffffff81091d66>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81091cd0>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 7a 67 ee ff 84 c0 74 07 48 83 c4 08 5b c9 c3 48 89 de 48 c7 c7 c8 0b 7a 81 e8 ed cc 39 00 <0f> 0b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 RIP [<ffffffff81162530>] kfree_debugcheck+0x30/0x40 RSP <ffff88002733dba0> ---[ end trace ff4011ce2a20c79c ]--- Kernel panic - not syncing: Fatal exception Pid: 24218, comm: jbd2/dm-2-8 Tainted: G D --------------- 2.6.32.masterDEBUG11A #1 Call Trace: [<ffffffff814ff155>] ? panic+0xa0/0x168 [<ffffffff815032e4>] ? oops_end+0xe4/0x100 [<ffffffff8100f26b>] ? die+0x5b/0x90 [<ffffffff81502bb4>] ? do_trap+0xc4/0x160 [<ffffffff8100ce35>] ? do_invalid_op+0x95/0xb0 [<ffffffff81162530>] ? kfree_debugcheck+0x30/0x40 [<ffffffffa036def3>] ? libcfs_debug_vmsg2+0x4e3/0xb60 [libcfs] [<ffffffff8100bedb>] ? invalid_op+0x1b/0x20 [<ffffffff81162530>] ? kfree_debugcheck+0x30/0x40 [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs] [<ffffffff8116594b>] ? kfree+0x5b/0x2a0 [<ffffffffa035ebae>] ? cfs_free+0xe/0x10 [libcfs] [<ffffffffa04ceb73>] ? lu_global_key_fini+0xa3/0xf0 [obdclass] [<ffffffffa04cf380>] ? key_fini+0x60/0x190 [obdclass] [<ffffffffa04cf4df>] ? keys_fini+0x2f/0x120 [obdclass] [<ffffffffa04cf5fd>] ? lu_context_fini+0x2d/0xc0 [obdclass] [<ffffffffa0b86aa2>] ? osd_trans_commit_cb+0xe2/0x2b0 [osd_ldiskfs] [<ffffffffa0a3f21a>] ? ldiskfs_journal_commit_callback+0x8a/0xc0 [ldiskfs] [<ffffffffa00a18af>] ? jbd2_journal_commit_transaction+0x110f/0x1530 [jbd2] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8107eabb>] ? try_to_del_timer_sync+0x7b/0xe0 [<ffffffffa00a7128>] ? kjournald2+0xb8/0x220 [jbd2] [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa00a7070>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffff81091d66>] ? kthread+0x96/0xa0 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffff81091cd0>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20

Keith Mannthey (Inactive) added a comment - 07/Sep/12 1:16 AM

Moving kernels does not seem to reproduce the issue so it is not a lead. I am going to try some client nodes tomorrow. I saw the error on the MDS as well on my initial Master run but have not see it since.

Keith Mannthey (Inactive) added a comment - 07/Sep/12 1:16 AM Moving kernels does not seem to reproduce the issue so it is not a lead. I am going to try some client nodes tomorrow. I saw the error on the MDS as well on my initial Master run but have not see it since.

Jian Yu added a comment - 06/Sep/12 11:36 PM

Per the above test report, the slab corruption issue occurred only on the MDS (fat-intel-2):

fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1b534f8, len=2048
fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1d776f8, len=2048
fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e13ca4c8, len=2048
 sanity test_103: @@@@@@ FAIL: slab corruption detected

Jian Yu added a comment - 06/Sep/12 11:36 PM Per the above test report, the slab corruption issue occurred only on the MDS (fat-intel-2): fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1b534f8, len=2048 fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e1d776f8, len=2048 fat-intel-2: Slab corruption (Not tainted): size-2048 start=ffff8802e13ca4c8, len=2048 sanity test_103: @@@@@@ FAIL: slab corruption detected

Keith Mannthey (Inactive) added a comment - 06/Sep/12 6:55 PM

I have started a git bisect to narrow down the code change but I fear it is not realiable data. I am not sure what has happened on my local vms (I shuffled some vms around yesterday) but I am no longer able to reproduce the core issue. I am running Lustre: 2.3.50 (from Master) with kernel-2.6.32-279.5.2 an not triggering the issue. I am moving back to kernel-2.6.32-279.1.1 (confirmed failed with Yu's test run) to see if the issue reappears.

I will update when I know more.

Keith Mannthey (Inactive) added a comment - 06/Sep/12 6:55 PM I have started a git bisect to narrow down the code change but I fear it is not realiable data. I am not sure what has happened on my local vms (I shuffled some vms around yesterday) but I am no longer able to reproduce the core issue. I am running Lustre: 2.3.50 (from Master) with kernel-2.6.32-279.5.2 an not triggering the issue. I am moving back to kernel-2.6.32-279.1.1 (confirmed failed with Yu's test run) to see if the issue reappears. I will update when I know more.

Andreas Dilger added a comment - 06/Sep/12 6:02 PM

If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started. After that, it is hopefully possible to do a (manual?) git-bisect to find which patch is the culprit, or at least narrow down the range of patches that need to be examined manually. It is also important to check in each of the failure cases what node type the corruption is seen on (MDS, OSS, client), since that will also reduce the number of changes which might have introduced the problem.

It would make sense to include a check for the ~~LU-1844~~ list_add/list_del corruption messages as well, since I suspect that is also a sign of random memory corruption.

Andreas Dilger added a comment - 06/Sep/12 6:02 PM If there are no obvious sources of this corruption, it probably makes sense to submit this test patch as several separate changes, each based on one of the recent 2.2.* tags, to see if we can isolate when this corruption started. After that, it is hopefully possible to do a (manual?) git-bisect to find which patch is the culprit, or at least narrow down the range of patches that need to be examined manually. It is also important to check in each of the failure cases what node type the corruption is seen on (MDS, OSS, client), since that will also reduce the number of changes which might have introduced the problem. It would make sense to include a check for the LU-1844 list_add/list_del corruption messages as well, since I suspect that is also a sign of random memory corruption.

Jian Yu added a comment - 06/Sep/12 11:08 AM

Hi Keith,

FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876, I reproduced the issue with PTLDEBUG=-1 manually:
https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c

Jian Yu added a comment - 06/Sep/12 11:08 AM Hi Keith, FYI, with the build for patch set 5 of http://review.whamcloud.com/#change,3876 , I reproduced the issue with PTLDEBUG=-1 manually: https://maloo.whamcloud.com/test_sets/59a5ca46-f832-11e1-b114-52540035b04c

Jian Yu added a comment - 05/Sep/12 11:44 PM

Hi Keith,

By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876, I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103:
https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c

The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

Jian Yu added a comment - 05/Sep/12 11:44 PM Hi Keith, By using the build http://build.whamcloud.com/job/lustre-reviews/8904/ in http://review.whamcloud.com/#change,3876 , I can manually reproduce the slab corruption issue on RHEL6 distro by only running sanity test 103: https://maloo.whamcloud.com/test_sets/2c479ade-f7d3-11e1-8b95-52540035b04c The autotest run for the above build skipped sanity test 103 because it's in the EXCEPT_SLOW list. I'm updating the commit message to add SLOW=yes into the test parameters.

Keith Mannthey (Inactive) added a comment - 05/Sep/12 3:14 PM

My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

Keith Mannthey (Inactive) added a comment - 05/Sep/12 3:14 PM My config test didn't make it though build on the first pass but Yu has a very nice patch/test here I am watching http://review.whamcloud.com/#change,3876

People

Assignee:: Oleg Drokin

Reporter:: Alex Zhuravlev

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Sep/12 1:00 PM

Updated:: 29/May/17 3:35 AM

Resolved:: 13/Sep/12 1:39 AM