[LU-457] MDS crash under DDN file stress tests Created: 23/Jun/11 Updated: 13/Apr/12 Resolved: 13/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion - LLNL Running chaos-4 , 2.6.18-238.12.1.el5_lustre.g266a955 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6594 |
| Description |
|
Running the DDN file stress tests with tests spread across 60 clients, MDS panic'd, |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 23/Jun/11 ] |
|
2011-06-23 14:55:52 Kernel BUG at mm/slab.c:3482 3478 void kmem_cache_free(struct kmem_cache *cachep, void *objp) It looks like a slab corruption. Cliff, is this easy to reproduce? |
| Comment by Cliff White (Inactive) [ 23/Jun/11 ] |
|
And we have the second crash, but in a different location. |
| Comment by Cliff White (Inactive) [ 23/Jun/11 ] |
|
It seems I can make it crash again, but not the same location. |
| Comment by Peter Jones [ 23/Jun/11 ] |
|
HongChao Please can you look into this bug as your top priority today and make sure to provide a status update as to your findings (or just what you have looked into) before you leave for the day. Thanks Peter |
| Comment by Andreas Dilger [ 24/Jun/11 ] |
|
A second crash in a diiferent part of the code makes me think that there is memory corruption happening. Do we have any ability to enable slab debug, list debug, etc to try and isolate this problem earlier? On very tricky cases, Ricardo had also written a patch that made all allocations "use on once" and any thread that accessed old memory would Oops and dump a stack. |
| Comment by Johann Lombardi (Inactive) [ 24/Jun/11 ] |
|
> A second crash in a diiferent part of the code makes me think that there is memory corruption happening. Yes, that's what i initially suspected. > Do we have any ability to enable slab debug, list debug, etc to try and isolate this problem earlier? I think the chaos kernel has list debug but no slab debug. |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
Reverted to 1.8.5 GA, ext4 version lustre-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-tests-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 kernel-2.6.18-194.17.1.el5_lustre.1.8.5 lustre-tools-llnl-1.2-6.ch4.4 lustre-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-ldiskfs-3.1.4-2.6.18_194.17.1.el5_lustre.1.8.5 Have run fsstress for > 45 minutes, no crash, few errors - RAID is still rebuiding. I am attaching the full console log from the last crash, from start of MDT to crash. |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
Spoke (literally) about two minutes too soon, MDT just panic'd - Log attached, previous mds log not attached |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
Crash running 1.8.5 GA -ext4 |
| Comment by Johann Lombardi (Inactive) [ 24/Jun/11 ] |
|
All crashes are in different area of the code. Those kind of problem are hard to diagnose. |
| Comment by Hongchao Zhang [ 24/Jun/11 ] |
|
it's hard to diagnose the problem. in 1.8.5 GA+ext4, the NULL pointer reference in RCU callback context should be a very rare |
| Comment by Hongchao Zhang [ 24/Jun/11 ] |
|
btw, I have discussed with Yujian, and she said the fsstress was only tested against 2.X before, and the scale was about |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
This is the MDS console log from the second 1.8.6, showing everything from mount to crash |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
Second 1.8.6 crash |
| Comment by Cliff White (Inactive) [ 24/Jun/11 ] |
|
Replicate 1.8.5 GA crash w/10 clients |
| Comment by Jian Yu [ 25/Jun/11 ] |
After doing a search on Bugzilla, I found the latest run of this test on Lustre master branch was against 2.0.0 Build 41 (v1_10_0_41): Branch: v1_10_0_41(client and servers) - ext3 Distro/Arch: RHEL5/x86_64(client and servers) Cluster: Hyperion MDS Node: hyperion-mds7 OSS Node: hyperion-ost[5,6,7] Client Nodes: 224 (patchless) Network Type: o2ib Test: client mount 4 mounts but run fs_test only on 1 Result: bug 22802 And it was also run on Lustre b1_8 branch several times, the latest run was against 1.8.2 and passed, please refer to bug 21292. |
| Comment by Cliff White (Inactive) [ 25/Jun/11 ] |
|
System crashed after 2 hours, but immediately rebooted - no crash dump. Log attached |
| Comment by Cliff White (Inactive) [ 25/Jun/11 ] |
|
Friday night fsstress crash |
| Comment by Hongchao Zhang [ 27/Jun/11 ] |
|
in the corrupted slab cache, the block size is 128, and both the previous and next block of the corrupted one are used |
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Build Master (Inactive) [ 27/Jun/11 ] |
|
Integrated in Johann Lombardi : 529529a3dfbb8849f479c15f75e8f875ba5427b4
|
| Comment by Peter Jones [ 28/Jun/11 ] |
|
This is suspected to be unrelated to Lustre. The Hyperion configuration has been fixed and Cliff will re-test next week to confirm |
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Build Master (Inactive) [ 28/Jun/11 ] |
|
Integrated in Johann Lombardi : a0375b393ef34c457bec68835ed6ac2f37b5a537
|
| Comment by Peter Jones [ 13/Apr/12 ] |
|
Has not happened for a long time so let's close and reopen if it reoccurs |