[LU-534] (mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: -> LBUG Created: 26/Jul/11 Updated: 09/May/12 Resolved: 26/Jan/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 1.8.8 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL5 on all affected machines, Lustre exported via NFS |
||
| Attachments: |
|
| Severity: | 3 |
| Bugzilla ID: | 17,764 |
| Rank (Obsolete): | 6577 |
| Description |
|
We hit this LBUG frequently on one of our production file systems and now have managed to reproduce reliably on our test file system by exporting the Lustre file system via NFS on one Lustre client and by running a version of racer on a NFS client in the exported Lustre file system. After a few minutes the LBUG will happen on the MDS. We've initially seen this on Lustre 1.6.7.2, then 1.8.3-ddn3.3 and now have been able to reproduce on the test file system after upgrading the MDS to 1.8.6-wc1, leaving the OSSes and clients at 1.8.3-ddn3.3 for now.
I'll attach the racer scripts and lustre-log. I'm not sure but at least earlier traces seemed to look like it might have been this bug, now reporting here as I can still reproduce it with the 1.8.6-wc1: https://bugzilla.lustre.org/show_bug.cgi?id=17764 [MDS:]cat /proc/fs/lustre/version |
| Comments |
| Comment by Zhenyu Xu [ 26/Jul/11 ] |
|
patch tracking at http://review.whamcloud.com/1141 |
| Comment by Frederik Ferner (Inactive) [ 01/Aug/11 ] |
|
I've upgraded the MDS to the kernel/lustre version with the patch. I can still reproduce the problem. The call trace this time looks slightly different, not sure if it is relevant: Jul 29 14:51:28 cs04r-sc-mds02-03 kernel: LustreError: 7210:0:(mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 1d2773:5b45ac47 (ffff81023a3a8078) inode ffff81023a8f9c30/1910643/1531292743 I've got the lustre-log file available and can upload it if required. |
| Comment by Cory Spitz [ 04/Aug/11 ] |
|
When we've seen this bug at Cray it has always been related to re-exporting over NFS as was described here. |
| Comment by Frederik Ferner (Inactive) [ 11/Aug/11 ] |
|
I've now reproduced it (using the patch version, build: jenkins-g1b1f5ae-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443) after enabling full debugging on the MDS: lnet.debug = trace inode super ext2 malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec The call trace looks slightly different again: Aug 11 17:17:47 cs04r-sc-mds02-03 kernel: LustreError: 7339:0:(mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 1d276d:bf466b4d (ffff810224c7f9c0) inode ffff810425970a70/1910637/3209063245 I'll attach the decoded lustre-log in the hope that it might be useful. |
| Comment by Frederik Ferner (Inactive) [ 11/Aug/11 ] |
|
lustre log after LBUG with full debugging enabled. |
| Comment by Cory Spitz [ 10/Oct/11 ] |
|
FYI, Vladimir S. has reviewed the logs from Frederik and posted an update in bz 17764 |
| Comment by Vladimir V. Saveliev [ 13/Oct/11 ] |
|
I can reproduce the problem with set of open()s, read()s, unlink() and close(). Details are in https://bugzilla.lustre.org/show_bug.cgi?id=17764#c109 |
| Comment by Cory Spitz [ 21/Nov/11 ] |
|
Vladimir has a patch available for inspection at https://bugzilla.lustre.org/attachment.cgi?id=33079 that simply removes what he feels is a bogus assert. |
| Comment by Frederik Ferner (Inactive) [ 28/Nov/11 ] |
|
Vladimir has an updated patch available for inspection at https://bugzilla.lustre.org/attachment.cgi?id=33110. |
| Comment by Cory Spitz [ 05/Dec/11 ] |
|
bz 17764 is marked RESOLVED-FIXED with https://bugzilla.lustre.org/attachment.cgi?id=33110 and https://bugzilla.lustre.org/attachment.cgi?id=33121. |
| Comment by Zhenyu Xu [ 19/Dec/11 ] |
|
Include bz17764 patches at http://review.whamcloud.com/1894 (fix patch) and http://review.whamcloud.com/1895 (test patch) |
| Comment by Frederik Ferner (Inactive) [ 09/Jan/12 ] |
|
In view of Johann's comment on the patch, is it worth testing the patch on our side? If so, is there a rpm with the patch available for RHEL5 anywhere? The link to the autobuilt rpm in the review page is returning 404 for me. |
| Comment by Zhenyu Xu [ 09/Jan/12 ] |
|
I've pushed it for another build, and I also will try to reproduce it. |
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Build Master (Inactive) [ 16/Jan/12 ] |
|
Integrated in Result = SUCCESS
Johann Lombardi : 66cd9a73abc2f075abf7ce78215a1d0cb5038a62
|
| Comment by Peter Jones [ 16/Jan/12 ] |
|
Frederik It looks like you can now go ahead and test the fix. The RPMs can be obtained at http://build.whamcloud.com/job/lustre-reviews/4163/ Regards Peter |
| Comment by Peter Jones [ 26/Jan/12 ] |
|
Frederik Have you had a chance to test out this fix yet? If not, when do you expect to have an opportunity to do so? Please advise Peter |
| Comment by Frederik Ferner (Inactive) [ 26/Jan/12 ] |
|
Peter, apologies for my late reply. I've been trying to reproduce this bug on my test system using the unpatch version of Lustre and it seems I have lost the ability to reproduce it. I'm not sure what has changed on our side though. I'll keep trying and I've downloaded the RPMs with the fix so I'll have them available locally once I can reproduce it. Kind regards, |
| Comment by Peter Jones [ 26/Jan/12 ] |
|
ok Frederik then let's close this ticket for now and reopen it if you find that this problem reoccurs in the future and this patch does not address the problem. |