Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.4.3
-
None
-
3
-
14196
Description
Hi,
CEA had 3 consecutive crashes of MDS server due to a corruption of kernel slab, and these crashes were concomitant to Robinhood startup.
In fact in the Lustre Changelogs we can see a line like:
4281746500 01CREAT 17:47:33.690285184 2014.03.25 0x0 t=[0x22cb19e89:0x1f9c9:0x0] p=[0x22cb19e89:0x1f9c8:0x0] GRANDEURS_CENTREES
And this is because Robinhood tried to process this Changelog entry that we crashed the MDS. Indeed, looking at the MDT with debugfs, we found out that this file was under the /OBJECTS directory (it used to be a regular file, that was moved here by Lustre for an unkown reason), and that its link EA was containing zero fid (link = "df f1 ea 11 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (24)).
This issue can be reproduced by doing the following on a Lustre filesystem originally formatted with Lustre 2.1, and upgraded to Lustre 2.4:
- on a Lustre client :
- touch <lustre dir>/file
- lfs path2fid <lustre dir>/file
- stop the file system and mount the MDT with ldiskfs
- move the file from <mdt ldiskfs>/ROOT/file to <mdt ldiskfs>/OBJECTS directory
- with setfattr change the link EA to remove all links :
- setfattr -n trusted.link -v 0xdff1ea110000000018000000000000000000000000000000 <mdt ldiskfs>/OBJECTS/file
- umount ldiskfs MDT and restart the file system
- on a Lustre client :
- lfs fid2path <lustre dir> <fid obtained at first step>
- the MDS server crashes
The crash can be avoided with the patch from LU-3474 at http://review.whamcloud.com/10464 .
But the scenario at CEA is worst that the error case found by Andreas (lfs fid2path on old IGIF FIDs) because here it crashes the MDS.
So at first we would need this patch to be landed to b2_4, or at least to know if this patch is suitable for use in production with Lustre 2.4.3.
Secondly, given that a file system at CEA has more than 5,000 files in the OBJECTS directory, we are wondering:
a) why some regular files are moved to OBJECTS directory? what is the Lustre mechanism leading to this?
b) why link EA contains zero fid when files are moved to OBJECTS directory?
c) why moving files to OBJECTS directory generates an entry in Lustre Changelogs?
TIA,
Sebastien.
Attachments
Issue Links
- is related to
-
LU-3474 MDS LBUG on unlink?
-
- Resolved
-
I verified that the 10464 patch is matching the fix that was landed for 2.5.0 (in 2.4.52, just after b2_4 was branched off master). I believe the patch should be safe for your production use in 2.4 (and we are planning to land it for the next 2.4.x release).
As for files in OBJECTS, that is confusing since this directory should only be used for nameless objects such as the quota admin files, and such. Regular files should not end up there. It appears almost as if the OBJECTS directory was being confused with the PENDING directory, for files that are open but unlinked. In that case, it would make sense that the "link" xattr has no links anymore, since there are no names for this file anymore. If possible, could you please check the FIDs on both the /OBJECTS and /PENDING to see if they conflict? You can do this while the MDT is mounted using: