[LU-1557] Hitting Kernel Bug in mballoc.c Created: 22/Jun/12 Updated: 30/Jul/13 Resolved: 30/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Joe Mervini | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Chaos 4.4-4.1 (RHEL 5.5) with Lustre 1.8.5.0.6chaos release. Hardware is Dell R710 with IBSRP connected DDN 9900 backend storage. Client traffic is both IB and 10gigE routed. The problem was also experienced on the same hardware running Chaos 4.4-2 (RHEL 5.5) and Lustre 1.8.5.0.3chaos. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4048 |
| Description |
|
In the past week we have encountered 3 separate OSSs panic with the following message. Two of the incidents occurred with Chaos 4.4-2, lustre 1.8.5.0.3 but we had scheduled to update to the 4.4-4.1 release this week. When we encountered the second panic on Monday we pushed it up to then. But we just encountered the same panic with the newer operating environment. We are in the process of trying to identify the codes/users that were running at the time of the panics but I wanted to get this posted immediately. Unfortunately the lustre.log file was lost because of our diskless environment. I am working to correct that so that logs are sent to an NFS mount in the future. LDISKFS-fs error (device dm-0): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 38734, block = 126923574 mlx4_en(U) joydev(U) mlx4_core(U) shpchp(U) pcspkr(U) cxgb4(U) serio_raw(U) i7core_edac(U) edac_mc(U) ehci_hcd(U) 8021) Code: 0f 0b 68 69 98 8e 88 c2 82 0e 48 8b 85 e8 fe ff ff f0 44 01 |
| Comments |
| Comment by Cliff White (Inactive) [ 22/Jun/12 ] |
|
This looks somewhat like |
| Comment by Joe Mervini [ 22/Jun/12 ] |
|
I thought the same thing when I encountered the panic and I believe that the patch was incorporated in all versions > 1.8.4 based on Chris' comments in The only other tidbit of information is from the syslog: Jun 22 14:07:17 aoss5 LDISKFS-fs error (device dm-0): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 38734 One thing that occurred several weeks ago was we had a severe power outage that pretty much took down everything. There were a number of controllers that needed to be replaced. However, once everything was operational again, I ran fscks on each of the OSTs without error. I am only mentioning it because it was an event that occurred recently. Is there anything else I can be looking for? |
| Comment by Peter Jones [ 23/Jun/12 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Joe Mervini [ 23/Jun/12 ] |
|
We had another server hang this morning with the same kernel bug. In this particular case the console logs produced a LOT more information and identified some other bugs. I attached the syslog info and the console log. Unfortunately it did not capture any debug info at the time of the crash - only reconnect info after the reboot. |
| Comment by Niu Yawei (Inactive) [ 25/Jun/12 ] |
|
The traces showed on console looks like b=16680, inconsistence was found between free blocks in pa descriptor and free blocks in bitmap. And the b=16680 was resolved by the patch from b=22299, which is removing the 'set rdonly to device on umount'. One thing confused me is that why 'set rdonly' could lead to filesystem inconsistency? Theoretically, I think journal should be able to guarantee the fs consistency no matter when we 'set rdonly' to the device, so maybe the 'removing set rdonly on umount' patch just reduce the chance of hitting the problem, but didn't fix the defect itself. I suspect the defect could be lurking in jbd or mballoc (some data isn't written back in transaction)? Unfortunately, I'm not expert in these areas. Alex, any comments? Could you give some guidance on what's the next step should we moving on to tracing this problem? Thanks in advance. |
| Comment by Alex Zhuravlev [ 25/Jun/12 ] |
|
because rdonly means "all writes will be skipped". so, we update on-disk bitmap (but that doesn't hit the disk), |
| Comment by Andreas Dilger [ 25/Jun/12 ] |
|
The first question here is whether this ldiskfs is built from ext3 or ext4 code? It should print this when the OST is first mounted, like "ldiskfs created from ext3-2.6-rhel5" or similar. It looks like the ldiskfs code that you are running is missing a patch on master named "ext4-mballoc-pa_free-mismatch.patch". This can be seen because the error message printed here was changed in that patch (note square brackets around the values that are not in your output): - if (free != pa->pa_free) { - printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n", - pa, (unsigned long) pa->pa_lstart, - (unsigned long) pa->pa_pstart, - (unsigned long) pa->pa_len); + + /* "free < pa->pa_free" means we maybe double alloc the same blocks, + * otherwise maybe leave some free blocks unavailable, no need to BUG.* / + if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) { + ext4_error(sb,"pa free mismatch: [pa %p] " + "[phy %lu] [logic %lu] [len %u] [free %u] " + "[error %u] [inode %lu] [freed %u]", pa, + (unsigned long)pa->pa_pstart, + (unsigned long)pa->pa_lstart, + (unsigned)pa->pa_len, (unsigned)pa->pa_free, + (unsigned)pa->pa_error, pa->pa_inode->i_ino, + free); The logfile shows the old format message: Jun 23 09:12:50 aoss3 LDISKFS-fs error (device dm-3): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 21178, block = 693960706 Jun 23 09:12:50 aoss3 Aborting journal on device dm-3. Jun 23 09:12:50 aoss3 pa ffff81034e51b4d8: logic 2562, phys. 694501376, len 510 Jun 23 09:12:50 aoss3 free 510, pa_free 254 However, it isn't clear whether the fix in that patch was included in your version of ldiskfs or not? |
| Comment by Joe Mervini [ 25/Jun/12 ] |
|
Andreas - Our OSTs were built from ext3: ldiskfs created from ext3-2.6-rhel5Lustre: Lustre Client File System; http://www.lustre.org/kjournald starting. Commit interval 5 seconds We have verified that patches from bugs 22299 and 29345 have been applied. The "ext4-mballoc-pa_free-mismatch.patch" has not. Over the weekend we had 9 separate occurrences of hitting the bug. During that time the file system was topping 90% capacity on a petabyte file system and I was running as many as 50 du processes at any given time to see who the big users were. (We did not initialize quotas early on and to do so now would probably require a pretty significant downtime for running quotacheck.) Although we hit this bug earlier in the week when du was not being run, the number of times we did this weekend make it suspect. So a couple of questions: 1. The assumption here is that mballoc should only be called when files are created/expanded. Should du have any interaction with mballoc? |
| Comment by Alex Zhuravlev [ 26/Jun/12 ] |
|
du scans filesystem, allocates memory for new buffers causing dirty data to get flushed, so allocations on OST. |
| Comment by Joe Mervini [ 26/Jun/12 ] |
|
Please excuse my ignorance, but when you say du scans the file system do you mean that it is scanning the area of interest or does it look at the entire OST only reporting on the area of interest? Also, with dirty data getting flushed, are you implying that dirty data is associated with the file being queried or are all the buffers just being flushed as an interrupt operation outside the normal scheduling? In any event, although I was pretty much abusing the file system with the number of simultaneous du operations, in the case of our file systems where we have on the order of 1000 users, the possibility that a large number of du's or ls -l operations occurring at the same time is not that remote and in fact the potential for an individual user to check his disk usage during the course of a run is quite high. |
| Comment by Alex Zhuravlev [ 26/Jun/12 ] |
|
no, du just puts amount of pressure on the memory (in terms of new buffers/pages), so the kernel flushes dirty data onto OST(s). this is a regular process and actually can happen w/o du. |
| Comment by Joe Mervini [ 26/Jun/12 ] |
|
(Hit the wrong key...) |
| Comment by Alex Zhuravlev [ 26/Jun/12 ] |
|
well, not that often given that it happens only once rdonly is set (at umount). in general, it's sensitive to workload and to the fragmentation as well - the higher fragmentation, the more bitmaps to scan, the more likely to reload bitmaps from the disk. and in the end, the assertion happens specifically at bitmap loading. |
| Comment by Joe Mervini [ 26/Jun/12 ] |
|
In our case the file system (or OST) is not being umounted or going read-only. I'm sure you're aware of that but just wanted to reiterate. Also since we formatted this file system with ext3 apparently we wouldn't have a fix for any pa_free_mismatch. |
| Comment by Alex Zhuravlev [ 26/Jun/12 ] |
|
sorry then. I was under wrong assumption.. need to think a bit. |
| Comment by Niu Yawei (Inactive) [ 27/Jun/12 ] |
|
Alex, is it possible the same problem of race in ext4_ext_walk_space()? ( |
| Comment by Alex Zhuravlev [ 27/Jun/12 ] |
|
no, I don't think so - the allocation itself is serialized in the callback. though I'll check that anyway. Joe, would it be correct to say the trouble began to happen after that power outage |
| Comment by Andreas Dilger [ 27/Jun/12 ] |
|
Joe, the reason that I ask about ext3 vs. ext4 for your ldiskfs is that ext3 is the version of mballoc that CFS patched and maintained for a long time, but the mballoc code was merged into RHEL5/6 only for ext4, and we're only using ext4 going forward with 1.8.7+. Do you have any plans to upgrade the servers to Lustre 1.8.8 (with ext4) in the near future? That would definitely determine if this bug is fixed already, since there were definitely several fixes in this area. I don't think that this has any inconsistency on disk, but I could definitely imagine that it relates to filesystem fullness. As Alex mentioned, this inconsistency is only detected when fetching the block allocation bitmaps from disk, which happens much more frequently when the filesystem is full. |
| Comment by Alex Zhuravlev [ 27/Jun/12 ] |
|
also, would you mind to grab output of debugfs -R stats from dm-3 device (on aoss3) and attach it here? thanks. |
| Comment by Joe Mervini [ 28/Jun/12 ] |
|
To answer the questions in succession, Alex - Yes, the trouble began after the power hit. In general the file system has been extremely stable. Andreas - We do not plan to update 1.8 opting instead to move to 2.X at the beginning of our fiscal year (October time frame). We just had a major push with the users to reduce they usage and have brought the capacity down to ~75%. Since then we have not had the panic condition - with the caveat being that a lot of the jobs that were running during the times of failure are not running now. We are still in the process of correlating that data hoping that we can zero in on a code that's exposing the failure condition. One other bit of detail is that when we looked at the targets that experienced the fault their capacity was higher than the other OSTs - ~91% - that kind of supports the fullness theory. That brings me back to the question I posted earlier about at what percentage of the total capacity should we consider a lustre file system "full". Alex - unfortunately because we took the file system away from the users for a couple of days (from a running jobs standpoint) I really can't take it down again to get the debugfs output on dm-3. If we do have a system panic again I'll be sure to get that though as well as the output for the current failure. |
| Comment by Andreas Dilger [ 30/Jul/13 ] |
|
I'm closing this old bug, since it has not been reported again since last year. |