[LU-1569] Many Files missing and others have no info (uid/gid/permissions) Created: 26/Jun/12 Updated: 06/Nov/13 Resolved: 06/Nov/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Critical |
| Reporter: | Brian Andrus (Inactive) | Assignee: | WC Triage |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS release 5.7 (Final) |
||
| Attachments: |
|
| Epic: | server |
| Rank (Obsolete): | 4002 |
| Description |
|
We have had a catastrophic failure of one of our lustre filesystems. Not sure exactly cause, but in our current state running lfsck on it gives TONS of errors like: And when we run a find on various users' directories, we find many "No such file" errors: ?--------- ? ? ? ? ? mcfd_tec.bin.660 |
| Comments |
| Comment by Cliff White (Inactive) [ 26/Jun/12 ] |
|
Have you successfully run 'fsck -fy' on all devices? Are you using the latest version of e2fsprogs, available at http://downloads.whamcloud.com/public/e2fsprogs/ |
| Comment by Brian Andrus (Inactive) [ 26/Jun/12 ] |
|
Initially our lustre filesystem (/work) had one of the osts disconnect (there are 10 each 7.8TB OSTs) and not reconnect. This put /work in read-only mode. Currently /work is mounted read only so users that do still have data intact can copy it to a clean filesystem. |
| Comment by Cliff White (Inactive) [ 26/Jun/12 ] |
|
Okay, thanks |
| Comment by Cliff White (Inactive) [ 26/Jun/12 ] |
|
First, as explained in the Lustre Manual, lustre-logs which are auto-dumped must be pre-processed on site to be useful, so we can't do much with what you attached. What we need in this case are the system logs (typically /var/log/messages) for all OSTs and the MDS/MGS for the period 12 hrs before you had the initial outage to the present time. Please do not filter the logs unless you need to remove IPs for security. |
| Comment by Cliff White (Inactive) [ 26/Jun/12 ] |
|
Please run lfs getstripe on one of the missing files, get the list of stripe objects and check the OSTs to determine if the data actually exists on the OST disk. Debugfs will work for this. $ debugfs -c -R "stat O/0/d$((818855 % 32))/818855" /dev/<your OST device> |
| Comment by Brian Andrus (Inactive) [ 26/Jun/12 ] |
|
Here is a quick check on one file that is showing in an ls, but missing info: [root@nas-0-1 hale]# ls -l|grep gempak$ [root@nas-0-1 hale]# debugfs -c -R "stat O/0/d$((5703559 % 32))/5703559" /dev/VG_hamming/work_ost0 |
| Comment by Brian Andrus (Inactive) [ 26/Jun/12 ] |
|
Tar file of /var/log/messages for MGS and OSSes |
| Comment by Cliff White (Inactive) [ 26/Jun/12 ] |
|
Thanks - did you keep any logs/output from the first fsck you did after the initial failure? Please attach if so. |
| Comment by Brian Andrus (Inactive) [ 27/Jun/12 ] |
|
The only log I have is the output from lfsck, but it is 7.9GB |
| Comment by Brian Andrus (Inactive) [ 27/Jun/12 ] |
|
Attached output from running lctl df on all the dump logs that were generated (lustre.log) |
| Comment by Cliff White (Inactive) [ 27/Jun/12 ] |
|
We need the fsck data, not the lfsck. |
| Comment by Brian Andrus (Inactive) [ 28/Jun/12 ] |
|
That I do not have. I do know there are many files in lost+found on the backing filesystem. I have not examined them yet since it is now mounted as lustre (albeit read-only). |
| Comment by Cliff White (Inactive) [ 05/Jul/12 ] |
|
Have you run the lost+found recovery script? |
| Comment by Andreas Dilger [ 11/Jul/12 ] |
|
That would be "ll_recover_lost_found_objs", which should be installed on all the OSTs. You need to mount the OST locally using "-t ldiskfs" instead of as "-t lustre" to run this tool. It will rebuild the corrupted object directories and move all the objects from lost+found back into their proper location. |