[LU-3420] OI scrubbing could not automatically engage after restoring a secondary MDT from a (file-level) backup Created: 30/May/13  Updated: 13/Sep/13  Resolved: 10/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Li Wei (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: mq313

Attachments: Text File ls-remote-dir-enoent.log    
Issue Links:
Related
is related to LU-3332 sanity-scrub and sanity-lfsck need to... Resolved
Severity: 3
Rank (Obsolete): 8481

 Description   

When adapting sanity-scrub 4 to exercise not only MDT 0 but also the secondary MDTs, I found that, after restoring a secondary MDT from its file-level backup, looking up corresponding "remote" directory would return ENOENT on clients:

[root@linux tests]# ls /mnt/lustre/d0.sanity-scrub/d4/mdt1
ls: cannot access /mnt/lustre/d0.sanity-scrub/d4/mdt1: No such file or directory

"mdt1" was created by "lfs mkdir -i 1". And, OI scrubbing did not engage automatically:

[root@linux tests]# cat /proc/fs/lustre/osd-ldiskfs/lustre-MDT0001/oi_scrub
name: OI_scrub
magic: 0x4c5fd252
oi_files: 64
status: init
flags: inconsistent
param:
time_since_last_completed: N/A
time_since_latest_start: N/A
time_since_last_checkpoint: N/A
latest_start_position: N/A
last_checkpoint_position: N/A
first_failure_position: N/A
checked: 0
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 0
run_time: 0 seconds
average_speed: 0 objects/sec
real-time_speed: N/A
current_position: N/A

The debug log shows that MDT 0 sent an UPDATE_OBJ OBJ_ATTR_GET RPC to MDT 1. The FID was found in the OI but the ino was (naturally) stale:

00000004:00000002:0.0:1369882480.737242:0:7229:0:(osd_handler.c:226:osd_iget()) unmatched inode: ino = 102, gen0 = 2698313523, gen1 = 294820613

According to osd_fid_lookup(), OI scrubbing is not triggered in this case.



 Comments   
Comment by Li Wei (Inactive) [ 30/May/13 ]

Attached the debug log. Note that this was a single-node setup.

Comment by Li Wei (Inactive) [ 30/May/13 ]

CC'ed Wang Di and Fan Yong.

Comment by Li Wei (Inactive) [ 30/May/13 ]

This and LU-3332 depends on each other.

Comment by Andreas Dilger [ 30/May/13 ]

Fan Yong, I understand that remote directory checking for DNE MDTs is part of LFSCK Phase III, but could you please investigate what work would be needed to fix the file-level backup/restore?

Li Wei, do you know if this is a problem on mdt0 or mdt1? Were both of them backed up and restored, or just mdt1?

Comment by Li Wei (Inactive) [ 31/May/13 ]

Andreas, all MDTs (MDSCOUNT=2, so both MDT 0 and 1) were backed up and restored during the test. The problem, as far as I discussed with Fan Yong yesterday, was on MDT 1---the direct FID lookup (without a prior name lookup) does not trigger OI scrubbing.

Comment by nasf (Inactive) [ 01/Jun/13 ]

I have made a patch to fix it:
http://review.whamcloud.com/#change,6515

Related reason has been described in the patch commit message.

Comment by nasf (Inactive) [ 10/Jul/13 ]

The patch has been landed to Lustre-2.5

Generated at Sat Feb 10 01:33:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.