[LU-5771] Crashed OSSs when unmounting OST without cleanuping orphan inodes properly Created: 20/Oct/14 Updated: 14/Jun/15 Resolved: 31/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 16196 | ||||
| Description |
|
During Tests, we hit something like following: <2>LDISKFS-fs error (device dm-2): __ldiskfs_ext_check_block: bad header/extent in inode #659: invalid magic - magic e000, entries 456, max 0(0), depth 51424(0) Some bad thing happen which forces filesystem to readonly, and there are still in memory orphan inode that was not cleared, which cause following problem: <4>Pid: 45622, comm: umount Not tainted 2.6.32-431.17.1.el6_lustre.2.5.18.ddn2.x86_64 #1 Dell Inc. PowerEdge R620/01W23F This maybe because a free after accessing problem, from codes inode memory is freed, and in ext4_put_super, we will access it which maybe cause problem(I am not sure about this part analysis.) But even above problem is not true, we still can run into: J_ASSERT(list_empty(&sbi->s_orphan)) which will crash kernel, so we need fix this problem. |
| Comments |
| Comment by Wang Shilong (Inactive) [ 20/Oct/14 ] |
|
This is patch that i tried to fix this problem: |
| Comment by Peter Jones [ 20/Oct/14 ] |
|
Yang Sheng Could you please advise on this issue and proposed patch? Thanks Peter |
| Comment by Andreas Dilger [ 21/Oct/14 ] |
|
It looks like this was fixed in the upstream kernel commit in 2.6.35: commit 4538821993f4486c76090dfb377c60c0a0e71ba3
Author: Theodore Ts'o <tytso@mit.edu>
Date: Thu Jul 29 15:06:10 2010 -0400
ext4: drop inode from orphan list if ext4_delete_inode() fails
There were some error paths in ext4_delete_inode() which was not
dropping the inode from the orphan list. This could lead to a BUG_ON
on umount when the orphan list is discovered to be non-empty.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a52d5af..533b607 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -221,6 +221,7 @@ void ext4_delete_inode(struct inode *inode)
"couldn't extend journal (err %d)", err);
stop_handle:
ext4_journal_stop(handle);
+ ext4_orphan_del(NULL, inode);
goto no_delete;
}
}
|
| Comment by Wang Shilong (Inactive) [ 22/Oct/14 ] |
|
Hello Andreas Dilger, Thanks for confirming, i missed it, so i will keep original commit message and resend my patch. Best Regard, |
| Comment by Andreas Dilger [ 25/Oct/14 ] |
|
This patch will also be needed for RHEL6.6. Wang Shilong, is it possible for you to submit a bug upstream to RH asking them to merge this patch into their RHEL6 kernel patches? Please include the reference to the upstream kernel patch. If not, Yang Sheng, can you do this? |
| Comment by Wang Shilong (Inactive) [ 25/Oct/14 ] |
|
Hello Andreas Dilger, I am glad to do this! Best Regards, |
| Comment by Peter Jones [ 27/Oct/14 ] |
|
Thanks Wang Shilong! I have passed along to Red Hat that we are also interested in seeing this fix land. |
| Comment by Wang Shilong (Inactive) [ 31/Oct/14 ] |
|
Hello, Seems this patch applies only for rhel6.5,with previous version there are conflicts, Best Regards, |
| Comment by Wang Shilong (Inactive) [ 31/Oct/14 ] |
|
One more question: I noticed Latest Lustre master seems not applying cleanly for rhel6.4, see following messages: [root@localhost linux-2.6.32-358.el6.x86_64]# quilt push -av Applying patch patches/raid5-mmp-unplug-dev-rhel6.patch So latest Lustre did not apply patches cleanly for rhel6.4, but i use series is So my question is master could not guarantee applying patches cleanly for all rhel6 series? |
| Comment by James A Simmons [ 31/Oct/14 ] |
|
We should see if this fix is needed for SLES11SP3. |
| Comment by Andreas Dilger [ 02/Dec/14 ] |
|
James, this shouldn't be needed for SLES11 since that is based on at least 3.0 kernels, and the bug was fixed in the upstream kernel in 2.6.35. Only the RHEL6 kernels are originally based on 2.6.32 (with a large number of other ext4 patches, but strangely not this one). |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12349/ |
| Comment by Yang Sheng [ 31/Dec/14 ] |
|
Patch landed. Close this ticket. |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13533 |