[LU-1557] Hitting Kernel Bug in mballoc.c Created: 22/Jun/12  Updated: 30/Jul/13  Resolved: 30/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Joe Mervini Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Chaos 4.4-4.1 (RHEL 5.5) with Lustre 1.8.5.0.6chaos release. Hardware is Dell R710 with IBSRP connected DDN 9900 backend storage. Client traffic is both IB and 10gigE routed. The problem was also experienced on the same hardware running Chaos 4.4-2 (RHEL 5.5) and Lustre 1.8.5.0.3chaos.


Attachments: File console.aoss3     File syslog.aoss3    
Severity: 3
Rank (Obsolete): 4048

 Description   

In the past week we have encountered 3 separate OSSs panic with the following message. Two of the incidents occurred with Chaos 4.4-2, lustre 1.8.5.0.3 but we had scheduled to update to the 4.4-4.1 release this week. When we encountered the second panic on Monday we pushed it up to then. But we just encountered the same panic with the newer operating environment.

We are in the process of trying to identify the codes/users that were running at the time of the panics but I wanted to get this posted immediately. Unfortunately the lustre.log file was lost because of our diskless environment. I am working to correct that so that logs are sent to an NFS mount in the future.

LDISKFS-fs error (device dm-0): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 38734, block = 126923574
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/ldiskfs/mballoc.c:3714
invalid opcode: 0000 [1] SMP
last sysfs file: /class/scsi_host/host0/local_ib_port
CPU 13
Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) lustre(U) ldiskfs(U) jbd(U) lov(U) crc16(U) mdc(U) lquotr

mlx4_en(U) joydev(U) mlx4_core(U) shpchp(U) pcspkr(U) cxgb4(U) serio_raw(U) i7core_edac(U) edac_mc(U) ehci_hcd(U) 8021)
Pid: 8957, comm: ll_ost_io_175 Tainted: G 2.6.18-108chaos #1
RIP: 0010:[<ffffffff888ddf66>] [<ffffffff888ddf66>] :ldiskfs:ldiskfs_mb_release_inode_pa+0x1eb/0x218
RSP: 0018:ffff8102d2beb200 EFLAGS: 00010206
RAX: 00000000000000d6 RBX: 00000000000001d6 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8032dd80
RBP: ffff8102d2beb330 R08: ffffffff8032dd88 R09: 0000000000000020
R10: ffffffff8049981c R11: 0000000000000000 R12: ffff8101693abdc8
R13: 0000000000003fd6 R14: 0000000000003fd6 R15: 00000000000001d6
FS: 00002ae81dc466e0(0000) GS:ffff81033fdd6d40(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000046bc68 CR3: 0000000000201000 CR4: 00000000000006e0
Process ll_ost_io_175 (pid: 8957, threadinfo ffff8102d2bea000, task ffff8102d2aaa080)
Stack: 000000000000984f ffff8101a34293d0 ffff8102d2beb340 ffff8105fbdc8000
ffff810198036568 ffff81063bb91c00 0000000000003a00 ffff8101269084f8
0000000000000000 0000000000000001 07eee5c3000022fd 00000000002b00c9
Call Trace:
[<ffffffff8001a470>] __getblk+0x28/0x248
[<ffffffff888e0712>] :ldiskfs:ldiskfs_mb_discard_inode_preallocations+0x209/0x2b0
[<ffffffff889cc2af>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x38f/0x620
[<ffffffff888d9bd9>] :ldiskfs:ldiskfs_ext_walk_space+0x182/0x1fa
[<ffffffff889cbf20>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x620
[<ffffffff889c8849>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xe9/0x120
[<ffffffff80066c8c>] __mutex_unlock_slowpath+0x31/0x38
[<ffffffff889c8a7a>] :fsfilt_ldiskfs:fsfilt_ldiskfs_map_ext_inode_pages+0x1fa/0x220
[<ffffffff8002efc1>] __wake_up+0x43/0x50
[<ffffffff888a911e>] :jbd:start_this_handle+0x321/0x3c7
[<ffffffff889c8ae1>] :fsfilt_ldiskfs:fsfilt_ldiskfs_map_inode_pages+0x41/0xb0
[<ffffffff88a057fa>] :obdfilter:filter_direct_io+0x46a/0xd50
[<ffffffff887f09e0>] :lquota:filter_quota_acquire+0x0/0x120
[<ffffffff88a08552>] :obdfilter:filter_commitrw_write+0x17a2/0x2b30
[<ffffffff88a00038>] :obdfilter:filter_commitrw+0x58/0x2a0
[<ffffffff889aac3f>] :ost:ost_brw_write+0x1c2f/0x2410
[<ffffffff8872dea0>] :ptlrpc:lustre_msg_check_version_v2+0x10/0x30
[<ffffffff80093b5a>] default_wake_function+0x0/0xf
[<ffffffff889ae053>] :ost:ost_handle+0x2c33/0x5690
[<ffffffff8015f9f8>] __next_cpu+0x19/0x28
[<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50
[<ffffffff8872dce5>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
[<ffffffff8873d41e>] :ptlrpc:ptlrpc_server_handle_request+0x96e/0xdc0
[<ffffffff8873db7a>] :ptlrpc:ptlrpc_wait_event+0x30a/0x320
[<ffffffff8873eaf6>] :ptlrpc:ptlrpc_main+0xf66/0x1110
[<ffffffff8006101d>] child_rip+0xa/0x11
[<ffffffff8873db90>] :ptlrpc:ptlrpc_main+0x0/0x1110
[<ffffffff80061013>] child_rip+0x0/0x11

Code: 0f 0b 68 69 98 8e 88 c2 82 0e 48 8b 85 e8 fe ff ff f0 44 01
RIP [<ffffffff888ddf66>] :ldiskfs:ldiskfs_mb_release_inode_pa+0x1eb/0x218
RSP <ffff8102d2beb200>
LustreError: dumping log to /var/dumps/lustre-log.1340395410.9106



 Comments   
Comment by Cliff White (Inactive) [ 22/Jun/12 ]

This looks somewhat like LU-38, or https://bugzilla.lustre.org/show_bug.cgi?id=22299 - but per that bug, the fix should have landed on Chaos in the 1.8.4 timeframe. Can you verify you have the patch?

Comment by Joe Mervini [ 22/Jun/12 ]

I thought the same thing when I encountered the panic and I believe that the patch was incorporated in all versions > 1.8.4 based on Chris' comments in LU-38. It was one of the reasons for moving to the later version of the operating environment. I will check to make sure though. I have a strong suspicion that there is a new user code that is exposing this bug because until last weekend we have never encountered this particular failure in nearly a year of operation with the chaos-4.4-2 and lustre 1.8.5 environment.

The only other tidbit of information is from the syslog:

Jun 22 14:07:17 aoss5 LDISKFS-fs error (device dm-0): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 38734
, block = 1269235714
Jun 22 14:07:17 aoss5 Aborting journal on device dm-0.
Jun 22 14:07:17 aoss5 pa ffff8101693abdc8: logic 14848, phys. 1279868416, len 470
Jun 22 14:07:17 aoss5 free 470, pa_free 214
Jun 22 14:07:17 aoss5 ----------- [cut here ] --------- [please bite here ] ---------

One thing that occurred several weeks ago was we had a severe power outage that pretty much took down everything. There were a number of controllers that needed to be replaced. However, once everything was operational again, I ran fscks on each of the OSTs without error. I am only mentioning it because it was an event that occurred recently.

Is there anything else I can be looking for?

Comment by Peter Jones [ 23/Jun/12 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Joe Mervini [ 23/Jun/12 ]

We had another server hang this morning with the same kernel bug. In this particular case the console logs produced a LOT more information and identified some other bugs. I attached the syslog info and the console log. Unfortunately it did not capture any debug info at the time of the crash - only reconnect info after the reboot.

Comment by Niu Yawei (Inactive) [ 25/Jun/12 ]

The traces showed on console looks like b=16680, inconsistence was found between free blocks in pa descriptor and free blocks in bitmap.

And the b=16680 was resolved by the patch from b=22299, which is removing the 'set rdonly to device on umount'. One thing confused me is that why 'set rdonly' could lead to filesystem inconsistency? Theoretically, I think journal should be able to guarantee the fs consistency no matter when we 'set rdonly' to the device, so maybe the 'removing set rdonly on umount' patch just reduce the chance of hitting the problem, but didn't fix the defect itself. I suspect the defect could be lurking in jbd or mballoc (some data isn't written back in transaction)? Unfortunately, I'm not expert in these areas.

Alex, any comments? Could you give some guidance on what's the next step should we moving on to tracing this problem? Thanks in advance.

Comment by Alex Zhuravlev [ 25/Jun/12 ]

because rdonly means "all writes will be skipped". so, we update on-disk bitmap (but that doesn't hit the disk),
corresponded bh is released, then we try to load it again and find it inconsistent with our in-core data (pa).
the journal can't help here.

Comment by Andreas Dilger [ 25/Jun/12 ]

The first question here is whether this ldiskfs is built from ext3 or ext4 code? It should print this when the OST is first mounted, like "ldiskfs created from ext3-2.6-rhel5" or similar.

It looks like the ldiskfs code that you are running is missing a patch on master named "ext4-mballoc-pa_free-mismatch.patch". This can be seen because the error message printed here was changed in that patch (note square brackets around the values that are not in your output):

-       if (free != pa->pa_free) {
-               printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n",
-                       pa, (unsigned long) pa->pa_lstart,
-                       (unsigned long) pa->pa_pstart,
-                       (unsigned long) pa->pa_len);
+
+        /* "free < pa->pa_free" means we maybe double alloc the same blocks,
+         * otherwise maybe leave some free blocks unavailable, no need to BUG.*
/
+        if ((free > pa->pa_free && !pa->pa_error) || (free < pa->pa_free)) {
+                ext4_error(sb,"pa free mismatch: [pa %p] "
+                              "[phy %lu] [logic %lu] [len %u] [free %u] "
+                              "[error %u] [inode %lu] [freed %u]", pa,
+                              (unsigned long)pa->pa_pstart,
+                              (unsigned long)pa->pa_lstart,
+                              (unsigned)pa->pa_len, (unsigned)pa->pa_free,
+                              (unsigned)pa->pa_error, pa->pa_inode->i_ino,
+                              free);

The logfile shows the old format message:

Jun 23 09:12:50 aoss3 LDISKFS-fs error (device dm-3): ldiskfs_valid_block_bitmap: Invalid block bitmap - block_group = 21178, block = 693960706
Jun 23 09:12:50 aoss3 Aborting journal on device dm-3.
Jun 23 09:12:50 aoss3 pa ffff81034e51b4d8: logic 2562, phys. 694501376, len 510
Jun 23 09:12:50 aoss3 free 510, pa_free 254

However, it isn't clear whether the fix in that patch was included in your version of ldiskfs or not?

Comment by Joe Mervini [ 25/Jun/12 ]

Andreas - Our OSTs were built from ext3:

ldiskfs created from ext3-2.6-rhel5Lustre: Lustre Client File System; http://www.lustre.org/kjournald starting. Commit interval 5 seconds
LDISKFS FS on dm-4, internal journal

We have verified that patches from bugs 22299 and 29345 have been applied. The "ext4-mballoc-pa_free-mismatch.patch" has not.

Over the weekend we had 9 separate occurrences of hitting the bug. During that time the file system was topping 90% capacity on a petabyte file system and I was running as many as 50 du processes at any given time to see who the big users were. (We did not initialize quotas early on and to do so now would probably require a pretty significant downtime for running quotacheck.) Although we hit this bug earlier in the week when du was not being run, the number of times we did this weekend make it suspect.

So a couple of questions:

1. The assumption here is that mballoc should only be called when files are created/expanded. Should du have any interaction with mballoc?
2. There is a theory here that under heavy load and the fact that the file system is at > 90% capacity that finding available blocks could be an issue. Where is the warning track with regard to reaching the usable limits of the file system?

Comment by Alex Zhuravlev [ 26/Jun/12 ]

du scans filesystem, allocates memory for new buffers causing dirty data to get flushed, so allocations on OST.

Comment by Joe Mervini [ 26/Jun/12 ]

Please excuse my ignorance, but when you say du scans the file system do you mean that it is scanning the area of interest or does it look at the entire OST only reporting on the area of interest? Also, with dirty data getting flushed, are you implying that dirty data is associated with the file being queried or are all the buffers just being flushed as an interrupt operation outside the normal scheduling?

In any event, although I was pretty much abusing the file system with the number of simultaneous du operations, in the case of our file systems where we have on the order of 1000 users, the possibility that a large number of du's or ls -l operations occurring at the same time is not that remote and in fact the potential for an individual user to check his disk usage during the course of a run is quite high.

Comment by Alex Zhuravlev [ 26/Jun/12 ]

no, du just puts amount of pressure on the memory (in terms of new buffers/pages), so the kernel flushes dirty data onto OST(s). this is a regular process and actually can happen w/o du.

Comment by Joe Mervini [ 26/Jun/12 ]

(Hit the wrong key...)
To follow on, this means that hitting this bug could potentially occur very often and the fact that we hadn't encountered it until 2 weekends ago leads me to believe that something else is involved that coupled with processes like du is exposing it. As I mentioned earlier we where concerned that we were reaching the usable limits of the file system and when I looked at the individual OSTs, fragmentation was around 10%. Could that be a contributor?

Comment by Alex Zhuravlev [ 26/Jun/12 ]

well, not that often given that it happens only once rdonly is set (at umount). in general, it's sensitive to workload and to the fragmentation as well - the higher fragmentation, the more bitmaps to scan, the more likely to reload bitmaps from the disk. and in the end, the assertion happens specifically at bitmap loading.

Comment by Joe Mervini [ 26/Jun/12 ]

In our case the file system (or OST) is not being umounted or going read-only. I'm sure you're aware of that but just wanted to reiterate. Also since we formatted this file system with ext3 apparently we wouldn't have a fix for any pa_free_mismatch.

Comment by Alex Zhuravlev [ 26/Jun/12 ]

sorry then. I was under wrong assumption.. need to think a bit.

Comment by Niu Yawei (Inactive) [ 27/Jun/12 ]

Alex, is it possible the same problem of race in ext4_ext_walk_space()? (LU-508)?

Comment by Alex Zhuravlev [ 27/Jun/12 ]

no, I don't think so - the allocation itself is serialized in the callback. though I'll check that anyway.

Joe, would it be correct to say the trouble began to happen after that power outage
and before there was no occurence?

Comment by Andreas Dilger [ 27/Jun/12 ]

Joe, the reason that I ask about ext3 vs. ext4 for your ldiskfs is that ext3 is the version of mballoc that CFS patched and maintained for a long time, but the mballoc code was merged into RHEL5/6 only for ext4, and we're only using ext4 going forward with 1.8.7+. Do you have any plans to upgrade the servers to Lustre 1.8.8 (with ext4) in the near future? That would definitely determine if this bug is fixed already, since there were definitely several fixes in this area.

I don't think that this has any inconsistency on disk, but I could definitely imagine that it relates to filesystem fullness. As Alex mentioned, this inconsistency is only detected when fetching the block allocation bitmaps from disk, which happens much more frequently when the filesystem is full.

Comment by Alex Zhuravlev [ 27/Jun/12 ]

also, would you mind to grab output of debugfs -R stats from dm-3 device (on aoss3) and attach it here? thanks.

Comment by Joe Mervini [ 28/Jun/12 ]

To answer the questions in succession,

Alex - Yes, the trouble began after the power hit. In general the file system has been extremely stable.

Andreas - We do not plan to update 1.8 opting instead to move to 2.X at the beginning of our fiscal year (October time frame). We just had a major push with the users to reduce they usage and have brought the capacity down to ~75%. Since then we have not had the panic condition - with the caveat being that a lot of the jobs that were running during the times of failure are not running now. We are still in the process of correlating that data hoping that we can zero in on a code that's exposing the failure condition. One other bit of detail is that when we looked at the targets that experienced the fault their capacity was higher than the other OSTs - ~91% - that kind of supports the fullness theory. That brings me back to the question I posted earlier about at what percentage of the total capacity should we consider a lustre file system "full".

Alex - unfortunately because we took the file system away from the users for a couple of days (from a running jobs standpoint) I really can't take it down again to get the debugfs output on dm-3. If we do have a system panic again I'll be sure to get that though as well as the output for the current failure.

Comment by Andreas Dilger [ 30/Jul/13 ]

I'm closing this old bug, since it has not been reported again since last year.

Generated at Sat Feb 10 01:17:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.