[LU-17369] Missing OST objects after "lfs migrate" Created: 15/Dec/23  Updated: 15/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Minor
Reporter: Sergey Cheremencev Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During the race between "lfs migrate" and unlink it is possible to get files without appropriate OST objects. Below is a scenario:

1. "lfs migrate" transfers files from MDT0 to MDT1 for directory "dir"
2. client1 removes file "f1" from "dir". It removes object on MDT1 and appropriate objects on OSTs.
3. client1 disonnected from MDT1
4. MDT1 failover(probably kernel panic)
5. MDT1 recovery started
6. MDT0 resends "replay" request to create a new object for "f1" on MDT1(part of lfs migrate)

As the client was evicted right before MDT1 failover it doesn't participate in recovery and doesn't replay unlink for a new object on MDT1. Thus we have an object on MDT1 but without appropriate objects on OSTs.

Such files are usually displayed with "???" instead of attributes:

vm1:~/lustre2$ ls -l | head -3
ls: cannot access 'all_jobs_id': No such file or directory
total 101100
-????????? ? ? ? ? ? all_jobs_id 

Below is an example how to distinguish current issue from other cases when file could loose it's OST objects. As "lfs migrate" copies file attributes crtime will be always newer than ctime, atime and mtime:

[root@vm1 logs]# cat stat
debugfs -c -R "stat REMOTE_PARENT_DIR/0x2400013a1:0x1:0x0/f3" /tmp/lustre-mdt2 > statInode: 162   Type: regular    Mode:  0644   Flags: 0x0
Generation: 2069782550    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x64807111:00000000 -- Wed Jun  7 15:59:13 2023
 atime: 0x64807111:00000000 -- Wed Jun  7 15:59:13 2023
 mtime: 0x64807111:00000000 -- Wed Jun  7 15:59:13 2023
crtime: 0x64807120:b6e87414 -- Wed Jun  7 15:59:28 2023
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x2400013a0:0x3:0x0] compat=0 incompat=0
  trusted.lov (56) = d0 0b d1 0b 01 00 00 00 52 00 00 00 00 00 00 00 02 04 00 00 02 00 00 00 00 00 10 00 01 00 00 00 02 04 00 c0 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 
  trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 
  linkea: idx=0 parent=[0x2400013a1:0x1:0x0] name='f3'
BLOCKS:

[root@vm1 logs]# lfs getstripe /mnt/lustre/dir/f3
/mnt/lustre/dir/f3
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 1
	obdidx		 objid		 objid		 group
	     1	             4	          0x4	   0x2c0000402

[root@vm1 logs]# debugfs -c -R "stat O/2c0000402/d4/4" /tmp/lustre-ost2
debugfs 1.46.2.wc5 (26-Mar-2022)
/tmp/lustre-ost2: catastrophic mode - not reading inode or group bitmaps
O/2c0000402/d4/4: File not found by ext2_lookup  

Generated at Sat Feb 10 03:34:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.