[LU-716] MDT crashed at: Kernel BUG at fs/jbd2/transaction.c:982 Created: 23/Sep/11 Updated: 27/Sep/11 Resolved: 27/Sep/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6552 |
| Description |
|
This crash was found at ORNL when they were testing IR, please take a look at http://jira.whamcloud.com/browse/ORNL-8 for comments on Sep 23. Sep 23 10:57:56 barry-mds1 kernel: ----------- [cut here ] --------- [please bite here ] --------- |
| Comments |
| Comment by Zhenyu Xu [ 23/Sep/11 ] |
|
looks similar to |
| Comment by James A Simmons [ 23/Sep/11 ] |
|
I will try that patch out to see if it stops the problems I'm seeing. If that is the case perhaps we should consider it a blocker. |
| Comment by Andreas Dilger [ 23/Sep/11 ] |
|
This line appears to be J_ASSERT_JH(jh, handle->h_buffer_credits > 0), which simply means that the declared journal transaction handle does not have enough blocks reserved for the number of blocks that are being modified. In both cases the problem was hit during replay of an open/create operation, but I don't see how that can transform into needing llog records for orphan unlink? Especially during early recovery, if llog operations were started for a number of stripes then it would cause new llog objects to be allocated and inserted into the catalog, which could consume a large number of blocks from the journal handle. It looks like this is what is happening: ->mdd_create I'm surprised that VBR is being involved during Imperative Recovery, unless some clients are being missed during recovery? That shouldn't be happening, and implies that recovery is finishing before all the clients have a chance to participate. The other alternative is that some clients performed an RPC but did not get a reply, while later clients DID get a reply, and there is a gap in RPC transaction playback? The fix for the transaction credit problem is straightforward - during recovery, create operations need to allocate as many journal credits as an unlink to include an llog update for all of the stripes, so that they can handle the case of creating and writing to stripe_count llog files. |
| Comment by Andreas Dilger [ 23/Sep/11 ] |
|
Assign to Bobijam, since I think this is the same as |
| Comment by Zhenyu Xu [ 27/Sep/11 ] |
|
dup of |