[LU-17169] can't delete corrupted directory Created: 05/Oct/23  Updated: 03/Nov/23  Resolved: 03/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Peter Jones
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File nbp11.lfsck_dry_run.out.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Users has a courpted directory.

ls -l |grep vol
ls: cannot access 'volcano': No such file or directory
d????????? ? ? ? ? ? volcano

It is a directory on the 3rd MDT and here is stat output

 

debugfs: ls -l
....
151872827 40000 (18) 0 0 4096 31-Dec-1969 16:00 volcano
....
debugfs: stat volcano
Inode: 151872827 Type: directory Mode: 0000 Flags: 0x80000
Generation: 2866389135 Version: 0x00000000:00000000
User: 0 Group: 0 Project: 0 Size: 4096
File ACL: 0
Links: 2 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x651736e5:ea8ba3cc – Fri Sep 29 13:43:17 2023
atime: 0x00000000:fffffff8 – Wed Dec 31 16:00:00 1969
mtime: 0x00000000:fffffff8 – Wed Dec 31 16:00:00 1969
crtime: 0x651736e5:ea4e9a98 – Fri Sep 29 13:43:17 2023
Size of extra inode fields: 32
Extended attributes:
lma: fid=[0x28003d638:0x1:0x0] compat=0 incompat=2
EXTENTS:
(0):2357133314

How should we delete this? Should we run an lfsck?



 Comments   
Comment by Andreas Dilger [ 05/Oct/23 ]

The first thing to do before deleting anything is to check if there are any errors reported on the console logs on the client or MDS? Depending on the error, it might make sense to run e2fsck or lfsck to see if the directory can be repaired. You could run a read-only e2fsck to see if this directory inode number is reporting any errors.

Comment by Mahmoud Hanafi [ 06/Oct/23 ]

I started a lfsck dry-run, there are a lot of layout_repaire like this
Lustre: nbp11-MDT0001-osd: layout LFSCK master found bad lmm_oi for [0x28000233a:0x46cf:0x0]: rc = 56

 

 

 

layout_mdts_init: 0
layout_mdts_scanning-phase1: 1
layout_mdts_scanning-phase2: 2
layout_mdts_completed: 0
layout_mdts_failed: 0
layout_mdts_stopped: 0
layout_mdts_paused: 0
layout_mdts_crashed: 0
layout_mdts_partial: 0
layout_mdts_co-failed: 0
layout_mdts_co-stopped: 0
layout_mdts_co-paused: 0
layout_mdts_unknown: 0
layout_osts_init: 0
layout_osts_scanning-phase1: 0
layout_osts_scanning-phase2: 69
layout_osts_completed: 0
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 92227209
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 1
namespace_mdts_scanning-phase2: 2
namespace_mdts_completed: 0
namespace_mdts_failed: 0
namespace_mdts_stopped: 0
namespace_mdts_paused: 0
namespace_mdts_crashed: 0
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 1051
Comment by Mahmoud Hanafi [ 06/Oct/23 ]

I am also attaching the full lfsck dry-run output.

We will need to schedule dedicated time to run e2fsck.

Comment by Andreas Dilger [ 06/Oct/23 ]
Lustre: nbp11-MDT0001-osd: layout LFSCK master found bad lmm_oi for [0x28000233a:0x46cf:0x0]: rc = 56

What does "lfs getstripe -v /mnt/nbp11/.lustre/fid/0x28000233a:0x46cf:0x0" report for the file layout? The lmm_oi is the old "backpointer" from the file layout to store the FID, but it isn't really used for anything these days and doesn't necessarily indicate any sign of problems. If the filesystem is older then it is possible that it had a bug that wrote the lmm_oi in an incorrect format.

Comment by Mahmoud Hanafi [ 06/Oct/23 ]

 

/nobackupp11/.lustre/fid/0x28000233a:0x46cf:0x0
lmm_magic:         0x0BD10BD0
lmm_seq:           0x2000264db
lmm_object_id:     0x4f02
lmm_fid:           [0x2000264db:0x4f02:0x0]
lmm_stripe_count:  1
lmm_stripe_size:   4194304
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 22
	obdidx		 objid		 objid		 group
	    22	       6466362	     0x62ab3a	             0

 

 

This is a very old filesystem.

Comment by Andreas Dilger [ 06/Oct/23 ]

It looks like the FID stored in the layout is different than the FID of the file. That might be because the file was migrated but the layout FID was not updated. That was an old bug which has since been fixed.

Comment by Peter Jones [ 20/Oct/23 ]

Anything else needed here Mahmoud or can we close this ticket out?

Comment by Peter Jones [ 03/Nov/23 ]

Seems to be no further questions

Generated at Sat Feb 10 03:33:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.