[LU-966] post-fsck MDS LBUG during recovery due to missing FID Created: 05/Jan/12 Updated: 19/Nov/12 Resolved: 26/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alexandre Louvet | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4274 | ||||||||||||||||
| Description |
|
We already got this LBUG twice : It always occured after a MDS crash, shine fsck, shine start and during the Clients recovery timeframe. If we assume that the concerned+missing FID has been destroyed during the fsck on the MDT after a MDS crash for any other problem, and if we consider that there are no other possible scenario than such "external" action to lead to this situation (my opinion, but what do you think ??), can we think about to replace this Assert/LBUG with only a Warning message (at least during Client-recovery phase ...) ??? |
| Comments |
| Comment by Peter Jones [ 05/Jan/12 ] |
|
Bobi Can you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 05/Jan/12 ] |
|
would you mind posting the stackframe as the LBUG is hit? |
| Comment by Alexandre Louvet [ 06/Jan/12 ] |
|
I have at least 2 codes path : g14:2011-10-12-15:07:45 crash> bt PID: 66939 TASK: ffff88205b2e0100 CPU: 2 COMMAND: "mdt_33" #0 [ffff8820567eb358] machine_kexec at ffffffff8102e77b #1 [ffff8820567eb3b8] crash_kexec at ffffffff810a6cd8 #2 [ffff8820567eb488] panic at ffffffff81466b1b #3 [ffff8820567eb508] lbug_with_loc at ffffffffa051beeb [libcfs] #4 [ffff8820567eb558] mdd_la_get at ffffffffa09647e6 [mdd] #5 [ffff8820567eb598] mdd_iattr_get at ffffffffa0966f91 [mdd] #6 [ffff8820567eb5f8] mdd_attr_get_internal at ffffffffa0968883 [mdd] #7 [ffff8820567eb688] mdd_attr_get_internal_locked at ffffffffa0968f38 [mdd] #8 [ffff8820567eb6c8] mdd_attr_get at ffffffffa0968fa6 [mdd] #9 [ffff8820567eb728] cml_attr_get at ffffffffa0a335ef [cmm] #10 [ffff8820567eb788] mo_attr_get at ffffffffa09d65ea [mdt] #11 [ffff8820567eb7b8] mdt_reint_open at ffffffffa09dd40c [mdt] #12 [ffff8820567eb8d8] mdt_reint_rec at ffffffffa09c664f [mdt] #13 [ffff8820567eb928] mdt_reint_internal at ffffffffa09bda04 [mdt] #14 [ffff8820567eb9b8] mdt_intent_reint at ffffffffa09be085 [mdt] #15 [ffff8820567eba38] mdt_intent_policy at ffffffffa09b7270 [mdt] #16 [ffff8820567ebaa8] ldlm_lock_enqueue at ffffffffa0684a9d [ptlrpc] #17 [ffff8820567ebb48] ldlm_handle_enqueue0 at ffffffffa06ac4d1 [ptlrpc] #18 [ffff8820567ebbe8] mdt_enqueue at ffffffffa09b6dea [mdt] #19 [ffff8820567ebc18] mdt_handle_common at ffffffffa09b29f5 [mdt] #20 [ffff8820567ebc98] mdt_regular_handle at ffffffffa09b3a05 [mdt] #21 [ffff8820567ebca8] ptlrpc_server_handle_request at ffffffffa06d75f1 [ptlrpc] #22 [ffff8820567ebde8] ptlrpc_main at ffffffffa06d8992 [ptlrpc] #23 [ffff8820567ebf48] kernel_thread at ffffffff8100d1aa g15:2011-12-09-11:31:12 crash> bt PID: 11990 TASK: ffff88205aee6ee0 CPU: 13 COMMAND: "mdt_05" #0 [ffff88202f0cb668] machine_kexec at ffffffff8102e77b #1 [ffff88202f0cb6c8] crash_kexec at ffffffff810a6cd8 #2 [ffff88202f0cb798] panic at ffffffff81466b1b #3 [ffff88202f0cb818] lbug_with_loc at ffffffffa0518eeb #4 [ffff88202f0cb868] mdd_la_get at ffffffffa09ea7e6 #5 [ffff88202f0cb8a8] mdd_xattr_sanity_check at ffffffffa09eae5e #6 [ffff88202f0cb908] mdd_xattr_set at ffffffffa09ee42c #7 [ffff88202f0cb998] cml_xattr_set at ffffffffa0ad02cf #8 [ffff88202f0cba18] mdt_reint_setxattr at ffffffffa0a6110b #9 [ffff88202f0cbae8] mdt_reint_rec at ffffffffa0a5767f #10 [ffff88202f0cbb38] mdt_reint_internal at ffffffffa0a4ea34 #11 [ffff88202f0cbbc8] mdt_reint at ffffffffa0a4ed9c #12 [ffff88202f0cbc18] mdt_handle_common at ffffffffa0a439f5 #13 [ffff88202f0cbc98] mdt_regular_handle at ffffffffa0a44a05 #14 [ffff88202f0cbca8] ptlrpc_server_handle_request at ffffffffa06e1641 #15 [ffff88202f0cbde8] ptlrpc_main at ffffffffa06e29e2 #16 [ffff88202f0cbf48] kernel_thread at ffffffff8100d1aa Note that in some case, before the mds restart the workaround describe into |
| Comment by Zhenyu Xu [ 09/Jan/12 ] |
|
patch tracking at http://review.whamcloud.com/1928 |
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 25/Jan/12 ] |
|
Landed for 2.2 |
| Comment by Sebastien Buisson (Inactive) [ 03/Feb/12 ] |
|
Hi, Do you think it could be possible to have a 2.0 version of this fix? TIA, |
| Comment by Peter Jones [ 03/Feb/12 ] |
|
Sebastien There definitely comes a point when it becomes unproductive and risky to continue to try and cherrypick fixes back into an older release rather than rebaselining on a newer release. I wonder if our time would be better spent looking at how to address this on a 2.1.1 baseline? Peter |
| Comment by Sebastien Buisson (Inactive) [ 03/Feb/12 ] |
|
Well, the backport to 2.1 went off smoothly. Sebastien. |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
this patch ruins VBR recovery, we need to think about it once more |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
Problem with landed patch is that VBR checks are missed now is several cases, that may cause -ENOENT during recovery instead of version mismatch, also that can cause wrong client evictions when -ENOENT is expected by VBR but code will exit early, causing failure. The initial version of patch looks safer when LASSERTs in MDD are replaced just with CERROR |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
Also I vote for returning replay-vbr.sh into normal tests set for master patches. It started to fail right after |
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
|
| Comment by Christopher Morrone [ 23/Feb/12 ] |
|
I am not a fan of this commit:
I really think that should have been two commits: 1) revert commit (and I would recommend that we use git's own revert message, perhaps with an additional explanation of why it is being reverted.) The commit as it is didn't even state WHICH commit is was reverting. Could we please keep reverts as a separate commit in the future? |
| Comment by Peter Jones [ 26/Feb/12 ] |
|
Landed for 2.1.1 and 2.2 Chris I do agree with your point |
| Comment by Alexey Lyashkov [ 13/Mar/12 ] |
|
same issue may hit without fsck, but just some clients absent from a recovery and lustre hit a gap in recovery queue. |