[LU-486] ldiskfs_valid_block_bitmap: Invalid block bitmap Created: 05/Jul/11 Updated: 17/Dec/14 Resolved: 02/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | David Vasil (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Lustre 1.8.4ddn3.1 |
||
| Attachments: |
|
| Severity: | 2 |
| Bugzilla ID: | 23,959 |
| Rank (Obsolete): | 6112 |
| Description |
|
OSS throws LDISKFS-fs error stating that it encountered an invalid block bitmap. This results in the OST being remounted read-only and requiring a reboot of the OSS to recover. A subsequent 'e2fsck -fp <dev>' replays the journal and finds no errors on the OST. This issue has been seen spuriously during internal stress testing by Bernd and by some customers in the field. It has been seen by other Lustre users as well and reported on the lustre-discuss list. There is a bugzilla ticket open but it has not had any support activity since November 2010. I'm opening a Jira bug so this can be worked on. https://bugzilla.lustre.org/show_bug.cgi?id=23959 Logs from the start of the invalid block bitmap: |
| Comments |
| Comment by Peter Jones [ 05/Jul/11 ] |
|
HongChao Can you please look into this one? Thanks Peter |
| Comment by Johann Lombardi (Inactive) [ 05/Jul/11 ] |
|
Is the problem persistent across reboot? Also would you have a e2image of the corrupted filesystem? |
| Comment by David Vasil (Inactive) [ 05/Jul/11 ] |
|
Johann, |
| Comment by Hongchao Zhang [ 06/Jul/11 ] |
|
there is only two kinds of corruption of block bitmap: the inode bitmap block or block bitmap block, but in this case, more debug info is needed to make clear where the problem is, and a debug patch is underway to collect more info about |
| Comment by David Vasil (Inactive) [ 06/Jul/11 ] |
|
I am currently gathering an e2image of the LUN that hit this issue; it has not had the e2fsck run against it yet. I will provide the e2image when it completes. Please let me know what debug patch you would like to try and we will work on getting it on the system. |
| Comment by Nathan Dauchy (Inactive) [ 07/Jul/11 ] |
|
From linux/fs/ext4/balloc.c, ext4_valid_block_bitmap()... if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) { /* with FLEX_BG, the inode/block bitmaps and itable * blocks may not be in the group at all * so the bitmap validation will be skipped for those groups * or it has to also read the block group where the bitmaps * are located to verify they are set. */ return 1; }So this may explain why we have hit these errors on our older file system, that I think was originally formatted with an ext3-based ldiskfs, but not the more recent ones. How do I verify whether FLEX_BG is enabled or not? Given that the check is skipped for many file systems altogether anyway (apparently without much damage), would it make sense to just put in a short term patch to always "return 1", rather that wait for a new check/repair feature to be added to fsck.ext3? Thanks, |
| Comment by Peter Jones [ 07/Jul/11 ] |
|
Johann Could you please comment? Thanks Peter |
| Comment by Johann Lombardi (Inactive) [ 07/Jul/11 ] |
|
> So this may explain why we have hit these errors on our older file system, that I think was originally formatted You can see this with dumpe2fs -h, e.g.: > Given that the check is skipped for many file systems altogether anyway (apparently without much damage), Well, it is only skipped for flex_bg which can use a different layout. Unfortunately, the error message is not really helpful and Hongchao's debug patch might help to understand what is going on. > would it make sense to just put in a short term patch to always "return 1", rather that wait for a new check/repair feature to be added to fsck.ext3? Well, you can disable the checks if this really hurts production, but it would be great to understand what is going on since it might be a real corruption which can spread into the rest of the filesystem w/o the sanity check. |
| Comment by David Vasil (Inactive) [ 08/Jul/11 ] |
|
Johann, |
| Comment by Peter Jones [ 08/Jul/11 ] |
|
David Johann is on vacation today but I will email you privately about how to get the file to us Thanks Peter |
| Comment by Hongchao Zhang [ 12/Jul/11 ] |
|
David, there is some problem to decompress the image file, for the output of e2image is a sparse file and bzip2 can't handle it Thanks! |
| Comment by David Vasil (Inactive) [ 12/Jul/11 ] |
|
Hongchao, |
| Comment by David Vasil (Inactive) [ 12/Jul/11 ] |
|
Hongchao,
I'll try to do this in a two-step process, assuming I have enough disk |
| Comment by David Vasil (Inactive) [ 12/Jul/11 ] |
|
Hongchao, lseek: Invalid argument The resulting file was 2TB. Do you need a raw e2image image? |
| Comment by Johann Lombardi (Inactive) [ 12/Jul/11 ] |
|
Hongchao, i am running the following command: and it creates a spare file: still decompressing ... |
| Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ] |
|
ok, decompression is done. First of all, could you please confirm that dm-21 on lfs-oss-0-1 is lfs0-OST0018? The initial error was: Jul 2 23:23:20 lfs-oss-0-1 kernel: [4424700.521146] LDISKFS-fs error (device dm-21): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 57, block = 1867778 The state of group 57 in the image is the following: Group 57: (Blocks 1867776-1900543) [ITABLE_ZEROED] Checksum 0xa677, unused inodes 8166 Block bitmap at 1867776 (+0), Inode bitmap at 1867777 (+1) Inode table at 1867778-1868289 (+2) 18915 free blocks, 8190 free inodes, 0 directories, 8166 unused inodes Free blocks: 1868290-1868799, 1869824-1870591, 1870848-1870936, 1870938-1871071, 1871073-1871103, 1871360-1871615, 1871766-1871871, 1875968-1876187, 1876202-1876479, 1876992-1877759, 1877803-1878346, 1878348-1880063, 1880576-1880611, 1880613, 1880615-1880727, 1880731-1881599, 1881899-1885368, 1885370-1885374, 1885376-1885439, 1885696-1887999, 1890304-1892351, 1894400-1894911, 1895424-1896191, 1896198-1896199, 1896201, 1896204-1896208, 1896211-1896212, 1896214-1896225, 1896229-1896231, 1896244-1896246, 1896248, 1896250-1896252, 1896257, 1896265, 1896267-1896270, 1896282-1896300, 1896302-1896309, 1896311-1896320, 1896322-1896323, 1896327-1896331, 1896333-1896342, 1896344-1896351, 1896353-1896383, 1896386-1896387, 1896389, 1896391-1896409, 1896415-1896426, 1896448-1897324, 1897344-1897364, 1897466-1897469, 1897472-1897629, 1897688-1897942, 1897955-1897964, 1897979-1898495, 1899246-1900543 Free inodes: 466946-466954, 466956-475136 So block 1867778 is in inode table (1867778-1868289). The related piece of code is the following: /* check whether the inode table block number is set */ bitmap_blk = ldiskfs_inode_table(sb, desc); offset = bitmap_blk - group_first_block; next_zero_bit = ldiskfs_find_next_zero_bit(bh->b_data, offset + LDISKFS_SB(sb)->s_itb_per_group, offset); if (next_zero_bit >= offset + LDISKFS_SB(sb)->s_itb_per_group) /* good bitmap for inode tables */ return 1; So we check that the range 1867778-1868289 is marked as allocated in the block bitmap and it is ... |
| Comment by David Vasil (Inactive) [ 13/Jul/11 ] |
|
Johann, |
| Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ] |
|
Unfortunately, I still don't know how the OST get into this situation |
| Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ] |
|
David, would it be possible to remount the OST with errors=panic (it would cause the OST to call panic when the assertion is hit) and to collect a kernel crash dump? |
| Comment by Johann Lombardi (Inactive) [ 13/Jul/11 ] |
|
Hongchao, BTW, it still makes sense to continue working on improving the debug messages printed in ldiskfs_valid_block_bitmap() when we detect a problem. What we have today is really not enough. |
| Comment by Hongchao Zhang [ 15/Jul/11 ] |
|
the debug patch is at http://review.whamcloud.com/#change,1107 |
| Comment by Peter Jones [ 29/Jul/11 ] |
|
David Has this diagnostic patch been deployed at the affected site? Peter |
| Comment by David Vasil (Inactive) [ 29/Jul/11 ] |
|
Peter, |
| Comment by Peter Jones [ 29/Jul/11 ] |
|
ok thanks for the update David! |
| Comment by Peter Jones [ 13/Oct/11 ] |
|
Has the diagnostic patch been rolled out at the customer site yet? |
| Comment by Nathan Dauchy (Inactive) [ 13/Oct/11 ] |
|
Peter, |
| Comment by Peter Jones [ 13/Oct/11 ] |
|
ok thanks Nathan! |
| Comment by David Vasil (Inactive) [ 06/Jan/12 ] |
|
Peter, Also, the vmcore that was produced by kdump is incomplete and is only 8.9GB; so I'm not sure if that is useful. |
| Comment by Hongchao Zhang [ 12/Apr/12 ] |
|
sorry for delayed response! |
| Comment by Robin Humble [ 21/Jun/12 ] |
|
We hit this too. conman, fsck, tune2fs -l attached. once the journal was replayed by fsck there was no on-disk corruption found. we've recently updated from 1.8.5 based to 1.8.7 servers. These OSTs are a few years old with no flex_bg set. Also we turned on async journals at the same time. does DDN run with async journals? |
| Comment by Peter Jones [ 21/Jun/12 ] |
|
Yes I believe they do. |
| Comment by Shuichi Ihara (Inactive) [ 21/Jun/12 ] |
|
aync journal was enabled by default on 1.8.4ddn3.1, but after release of this, we relay on -wc default. so, by default it's disabled. I'm interested in Robin's comments that once async is enabled, we can hit this.. |
| Comment by Jason Hill (Inactive) [ 10/Jun/13 ] |
|
ORNL just hit this today – we've hit it in the past but with an e2fsck fixing the issue have just pushed through it. This is the second time in as many weeks that we have seen this. Yes, this is Lustre 1.8 (1.8.8); kernel is 2.6.18_308.4.1; and I think this was an original ext3 filesystem so the flex_bg notes from above aren't coming into play – but here's the dumpe2fs. [root@widow-oss8b2 ~]# dumpe2fs -h /dev/dm-27 Will post the e2fsck log when it is complete. |
| Comment by Jason Hill (Inactive) [ 10/Jun/13 ] |
|
Not sure if this log of the e2fsck will help or not, but here it is. We are running the following e2fsprogs:
|
| Comment by Bruno Faccini (Inactive) [ 12/Jun/13 ] |
|
Jason, thank's for the e2fsck log, but can you also provide the Console/syslog output showing the "Invalid block bitmap" ?? |
| Comment by Jason Hill (Inactive) [ 21/Jun/13 ] |
|
Bruno: The node was not dumped – preserving the bulk of the OST's on the node is pretty important to us – and by the time we get to the node it's spewed a lot of messages and we may not likely get the information you are looking for. Do you prefer to get a full system image on this? We do have a propensity to hit this issue after a filesystem downtime – had 2 during the week of June 10 (one additional over the comment I added on 6/10), and none this week. I do have syslog and console log that I'll work on attaching. -J |
| Comment by Jason Hill (Inactive) [ 11/Oct/13 ] |
|
Just a ping on this. We're still running 1.8.9-wc1 here at ORNL on the production system; we've hit this bug 8 times in the last 6 days. Is it more helpful to dump the node? We will likely be running this SW version until decommissioning in Feburary/March 2014. We'd be interested in seeing this one fixed if possible. |
| Comment by James Nunez (Inactive) [ 11/Oct/13 ] |
|
Jason, A quick question, are you running with the debug patch at http://review.whamcloud.com/#/c/1107/ ? Thanks, |
| Comment by Jason Hill (Inactive) [ 11/Oct/13 ] |
|
James, We are not running with that patch. We will have a chance to reboot the entire cluster this weekend – should we download, integrate and run this in production on all 144 OSS, 4 MDS, and 1 MGS servers? – |
| Comment by James Nunez (Inactive) [ 11/Oct/13 ] |
|
That seems like a large task. Would someone on this ticket please comment on if the proposed debug patch will give the information we need to debug this issue and, answering Jason's question, what nodes should it be installed on. Is there something else that ORNL can provide us with to better understand this issue? Thank you |
| Comment by Jason Hill (Inactive) [ 11/Oct/13 ] |
|
I will also comment that we've seen the increase as the utilization on the filesystems has gone up – we're over 90% on the two filesystems that have hit this issue most frequently in the last week. |
| Comment by Jason Hill (Inactive) [ 11/Oct/13 ] |
|
Also – James Simmons has a pretty slick build system where integrating the patch would only take a few minutes and the RPM build is 30 mins. We have a scheduled power outage on Saturday so we're going down anyway. The challenge is when do we take it all down again to remove the debug patch? Hopefully when we get a fix for this issue and a new set of RPM's. |
| Comment by Hongchao Zhang [ 11/Oct/13 ] |
|
Hi Jason the debug patch at http://review.whamcloud.com/#/c/1107/ has been updated, could you please apply it when you reboot your system? Thanks! |
| Comment by Bruno Faccini (Inactive) [ 11/Oct/13 ] |
|
Jason, high filesystem/OSTs usage is a known situation to likely encounter this problem. |
| Comment by James A Simmons [ 11/Oct/13 ] |
|
I will produce the rpms this morning Jason. |
| Comment by Jason Hill (Inactive) [ 13/Oct/13 ] |
|
These RPM's are in production as of 02:30am on 10/13/2013 and all devices are mounted with -o errors=panic. |
| Comment by Peter Jones [ 13/Oct/13 ] |
|
Thanks for the update Jason |
| Comment by Blake Caldwell [ 20/Dec/13 ] |
|
We just caught one on this filesystem with the debug patch. [root@widow-oss13c2 ~]# rpm -qi lustre [4432932.805736] LDISKFS-fs error (device dm-22): ldiskfs_valid_block_bitmap: Invalid block bitmap - group_first_block = 540311552, block_bitmap = 540311552, inode_bitmap = 540311553 inode_table_bitmap = 540311554, inode_table_block_per_group =512, block_group = 16489, block = 540311554 KERNEL: /usr/gedi/nfsroot/prod_lustre/usr/lib/debug/lib/modules/2.6.18-348.3.1.el5.widow/vmlinux crash> bt [root@widow-oss13c2 ~]# e2fsck -f $lun widow3-OST0039: ***** FILE SYSTEM WAS MODIFIED ***** |
| Comment by Matt Ezell [ 20/Dec/13 ] |
|
So this tells us (as we've seen before) that part of the inode table is marked as unallocated. It's not clear how unallocated it is (of the 512 blocks)... whether there's a single unallocated block or if the whole thing is zeroed. It would be nice to see the value of next_zero_bit. Ideally, this would have crashed in ldiskfs_valid_block_bitmap() instead of marking the error and then crashing in ldiskfs_journal_start_sb(). Then we could see what function called ldiskfs_read_block_bitmap(), which might be useful for tracing back the in-memory corruption. I guess for consistency sake, you want to cleanup a bit before panic'ing. I'm not that familiar with the 'crash' utility. This block bitmap should be in cache somewhere; how do I find it? |
| Comment by Hongchao Zhang [ 23/Dec/13 ] |
|
Is the content of the block bitmap buffer not printed as it was in the debug path? printk(KERN_ERR"block bitmap of block_group %d : \n", block_group); for (i = 0; i < (sb->s_blocksize >> 3); i++) { printk(KERN_ERR"%016lx ", *(((long int*)bh->b_data) + i)); if (i && ((i % 4) == 0)) printk(KERN_ERR"\n"); } could you please look at the syslog file to check whether this info was contained in it or not? Thanks. |
| Comment by Matt Ezell [ 23/Dec/13 ] |
|
No, that message is not on the console or syslog. I think ldiskfs_error (ext4_error) aborted the journal and then a different thread noticed the journal was aborted and panic()ed the node. It might be nice to refresh the patch to printk the bitmap before calling ext_error(). Unfortunately, I don't think we will have an opportunity to reboot with an updated image on this file system. And our new stuff (Atlas) was formatted with Lustre 2.4, so it should have flex_bg. I worry that whatever is causing this might still be present in newer versions of ext4/Lustre, but flex_bg will prevent us from noticing right away. Looking in the crash dump, I have the following process that I think hit the error: PID: 16218 TASK: ffff8102674b1820 CPU: 0 COMMAND: "ll_ost_io_383" |
| Comment by Matt Ezell [ 23/Dec/13 ] |
|
I was able to find the buffer_head for block 540311552: crash> struct buffer_head ffff8103a942c430 , } |
| Comment by Hongchao Zhang [ 26/Dec/13 ] |
|
as per the b_data, there is no zero bit in the first (512+2) bits? the patch is updated to print the content of the bitmap block from |
| Comment by Peter Jones [ 02/May/14 ] |
|
As per DDN this ticket is no longer relevant. They will open a new ticket if this ever occurs again |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Shilong Wang (wshilong@ddn.com) uploaded a new patch: http://review.whamcloud.com/13100 |