[LU-3934] Directories gone missing after 2.4 update Created: 11/Sep/13  Updated: 19/Jul/21  Resolved: 26/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Blocker
Reporter: Christopher Morrone Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl, yuc2
Environment:

lustre 2.4.0-17chaos (github.com/chaos/lustre)


Issue Links:
Related
is related to LU-14864 osd_fid_lookup() ASSERTION( !updated ... Open
is related to LU-4626 directories missing after upgrade fro... Resolved
is related to LU-4058 Interop 2.4.0<->2.5 failure on test s... Resolved
Severity: 3
Rank (Obsolete): 10401

 Description   

After upgrade of our servers from 2.1 to 2.4, our MDS crashed on LU-2842, and we applied the patch. That patch avoided the LBUG, but now it is clear that there is a more basic problem that we can no longer look up a bunch of the top-level subdirectories in this lustre filesystem.

We are seeing problems like:

2013-09-11 13:01:22 LustreError: 5570:0:(mdt_open.c:1687:mdt_reint_open()) lsc-MDT0000: name purgelogs present, but fid [0x2830891e:0xd1781321:0x0] invalid

It looks to me like the directory entries are still there, but FID lookups do not work on them. We verified that the directory named "purgelogs" appears on the underlying ldiskfs filesystem at ROOT/purgelogs.

We also see error messages diring recovery shortly after the recent boot like the following:

2013-09-11 12:58:27 sumom-mds1 login: LustreError: 4164:0:(mdt_open.c:1497:mdt_reint_open()) @@@ [0x24d18001:0x3db440f0:0x0]/XXXXXX->[0x24d98604:0
x2a32454:0x0] cr_flags=0104200200001 mode=0200100000 msg_flag=0x4 not found in open replay.  req@ffff8808263d1000 x1443453865661288/t0(46385661850
2) o101->f45d6fab-2c9c-6b39-0090-4935fbe03e32@192.168.115.87@o2ib10:0/0 lens 568/1176 e 0 to 0 dl 1378929568 ref 1 fl Interpret:/4/0 rc 0/0

(I X'ed out the user name there, but everything else is cut-and-paste.)

Any ideas on the next step to get these directories accessible again?



 Comments   
Comment by Peter Jones [ 11/Sep/13 ]

Di is looking into this

Comment by Di Wang [ 11/Sep/13 ]

hmm, apparently these FIDs are IGIF fid(generated in 1.8 ?). So It seems to me OI scrub did not handle this well, either it did not insert IGIF into OI correctly, or sth else got broken.

Comment by Christopher Morrone [ 11/Sep/13 ]

This filesystem was indeed formatted under 1.8 to the best of my knowledge.

Comment by Christopher Morrone [ 11/Sep/13 ]

OI scrub...wasn't that part of LFSCK? Don't assume that we have used that.

Comment by Di Wang [ 11/Sep/13 ]

It is my understanding that OI scrub will try to insert all of IGIF FIDs into OI during first startup after upgrade. So I guess it did not handle it well here, but Fan Yong is the expert. I think you probably can work around this by this temporary patch

[root@testnode lustre-release]# git diff
diff --git a/lustre/osd-ldiskfs/osd_oi.c b/lustre/osd-ldiskfs/osd_oi.c
index 8eff0a3..28f4c4c 100644
--- a/lustre/osd-ldiskfs/osd_oi.c
+++ b/lustre/osd-ldiskfs/osd_oi.c
@@ -528,7 +528,7 @@ int osd_oi_lookup(struct osd_thread_info *info, struct osd_device *osd,
        if (unlikely(fid_is_acct(fid)))
                return osd_acct_obj_lookup(info, osd, fid, id);
 
-       if (!osd->od_igif_inoi && fid_is_igif(fid)) {
+       if (/*!osd->od_igif_inoi &&*/ fid_is_igif(fid)) {
                osd_id_gen(id, lu_igif_ino(fid), lu_igif_gen(fid));
                return 0;
        }

But I really need check with Fan Yong (OI scrub author) to see his opinion.

Comment by Christopher Morrone [ 11/Sep/13 ]

Ah, ok, thanks.

Comment by Andreas Dilger [ 11/Sep/13 ]

The following is probably a bit more correct, though it wouldn't make any difference for the current situation unless the 2.4 filesystem has been backed up and then restored again:

        rc = __osd_oi_lookup(info, osd, fid, id);
        if (rc == -ENOENT && fid_is_igif(fid)) {
                osd_id_gen(id, lu_igif_ino(fid), lu_igif_gen(fid));
                rc = 0;
        }
        
        return rc;

The main difference is that it should always be doing the FID lookup in the OI first, and only if that is missing would it try directly mapping the IGIF FID to the underlying inode. That ensures that in a backup+restore case (where the IGIF FID is now preserved on the restored MDT) it will be able to find the inode. To reiterate, however, I don't think this is critical to the situation at hand.

Comment by Andreas Dilger [ 11/Sep/13 ]

Looking closer into the OI Scrub code, the root of this problem might be that the "od_igif_inoi" flag is set incorrectly right after mount:

        rc = osd_initial_OI_scrub(info, dev);
        if (rc == 0) {
                if (sf->sf_flags & SF_UPGRADE ||
                    !(sf->sf_internal_flags & SIF_NO_HANDLE_OLD_FID ||
                      sf->sf_success_count > 0)) {
                        dev->od_igif_inoi = 0;
                        dev->od_check_ff = 1;
                } else {
                        dev->od_igif_inoi = 1;
                        dev->od_check_ff = 0;
                }

                if (sf->sf_flags & SF_INCONSISTENT)
                        /* The 'od_igif_inoi' will be set under the
                         * following cases:
                         * 1) new created system, or
                         * 2) restored from file-level backup, or
                         * 3) the upgrading completed.
                         *
                         * The 'od_igif_inoi' may be cleared by OI scrub
                         * later if found that the system is upgrading. */
                        dev->od_igif_inoi = 1;

        /* OI files are invalid, should be rebuild ASAP */
        SF_INCONSISTENT = 0x0000000000000002ULL,

The "initial scrub" is the quick check of the MDT root-level Lustre files and directories (e.g. OI, ROOT/, etc) to see if they are sane, or if a full LFSCK needs to be run to rebuild the OI files.

In this case, it isn't clear to me that if a filesystem was upgraded from 2.1 to 2.4 that od_igif_inoi should be set, even though the previous check for SF_UPGRADE turned it off. It would be safer to leave it turned off until the full OI Scrub was completed (inserting all of the IGIF entries into the OI) before turning it back on.

In conjunction with the above change that always does the OI lookup first even for IGIF FIDs, I think this will avoid the problem being seen here. It may also be that this problem would have "solved itself" over time, once the OI Scrub had completed, but I'm not sure how long that would have taken. It might also be useful to get the output of "lctl get_param osd-ldiskfs.*MDT*.oi_scrub" from the MDT as well.

Comment by Christopher Morrone [ 11/Sep/13 ]
> lctl get_param 'osd-ldiskfs.\*.oi_scrub'
osd-ldiskfs.lsc-MDT0000.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 0
status: init
flags:
param:
time_since_last_completed: N/A
time_since_latest_start: N/A
time_since_last_checkpoint: N/A
latest_start_position: N/A
last_checkpoint_position: N/A
first_failure_position: N/A
checked: 0
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 0
run_time: 0 seconds
average_speed: 0 objects/sec
real-time_speed: N/A
current_position: N/A
Comment by Andreas Dilger [ 11/Sep/13 ]

It looks like OI Scrub has never been run on this MDT. I just did a test on a new filesystem with some files copied into it, ran "lctl lfsck_start -M testfs-MDT0000" to run a manual OI Scrub, and remounted the MDT and it still reported that a scrub had previously been run:

lctl get_param osd-ldiskfs.*MDT*.oi_scrub
osd-ldiskfs.testfs-MDT0000.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 64
status: completed
flags:
param:
time_since_last_completed: 127 seconds
time_since_latest_start: 127 seconds
time_since_last_checkpoint: 127 seconds
latest_start_position: 12
last_checkpoint_position: 524289
first_failure_position: N/A
checked: 1490
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 1
run_time: 0 seconds
average_speed: 1490 objects/sec
real-time_speed: N/A
current_position: N/A

I would have thought for a 2.1->2.4 upgrade that OI Scrub would have run automatically in order to add the IGIF FIDs to the OI file, maybe I'm mis-remembering when this code was landed? It was part of the LFSCK 1.5 project for handling 1.8->2.4 upgrade, since 2.4 was the last release that would support upgrades from 1.8.

Comment by Christopher Morrone [ 12/Sep/13 ]

We would certainly like this to work automatically before our next filesystem upgrade. But in the mean time, perhaps the best option is simply to kick off a manual OI scrub using lfsck_start?

If the automatic upgrade and the manual command are just going to have the same effect on the filesystem, there is probably no reason to be overly fearful of running lfsck.

Comment by Christopher Morrone [ 12/Sep/13 ]

Is there any obvious way to tell the lfsck_start --dryrun OI scrub from a non-dryrun OI scrub? I decided to give the dryrun version a try and it looks like its finding things to update (as expected):

/proc/fs/lustre/osd-ldiskfs/lsc-MDT0000 > cat oi_scrub 
name: OI_scrub
magic: 0x4c5fd252
oi_files: 0
status: scanning
flags: upgrade
param:
time_since_last_completed: N/A
time_since_latest_start: 718 seconds
time_since_last_checkpoint: 58 seconds
latest_start_position: 12
last_checkpoint_position: 30227006
first_failure_position: N/A
checked: 11297618
updated: 924507
failed: 0
prior_updated: 0
noscrub: 480
igif: 924512
success_count: 0
run_time: 718 seconds
average_speed: 15734 objects/sec
real-time_speed: 16568 objects/sec
current_position: 34190024

I haven't seen any change to the top-level subdirectories, so I am working under the assumption that dryrun is really being honored at the moment.

Comment by Di Wang [ 12/Sep/13 ]

Reassign to Fang yong.

Comment by nasf (Inactive) [ 12/Sep/13 ]

Sorry, you cannot make dryrun mode OI scrub now. You can pause/stop current OI scrub anytime via "lctl lfsck_stop -M lustre-MDT0000", but before the OI scrub finished, we do not know whether the IGIF <=> ion# mapping for the top-level subdirectories have been processed or not. You can estimate the OI scrub running time by the "average_speed".

Comment by nasf (Inactive) [ 12/Sep/13 ]

The OI scrub should be triggered automatically under such upgrading case, but it did not. Because current detect mechanism is as following:

1) When the device mount-up, the initial OI will be run automatically, and it will check /ROOT/.lustre, if it does not exist, then it will be regarded as upgrading case.

2) After mount-up, if OI scrub is triggered (manually or by some inconsistency), and when the OI scrub finds IGIF mode object, then it will be regarded as upgrading case.

For this failure, the upgrade path is from 1.8 to 2.1 firstly. Lustre-2.1 does not support OI scrub, but it will create /ROOT/.lustre. Then when it continues to upgrade from 2.1 to 2.4, then the initial OI scrub cannot detect the upgrading. I will fix the issues by checking OI file name and successfully OI scrub running count: if there is only one OI file "oi.16" and OI scrub has never run on the device, then it is quite probably an upgrading case.

I will push the patch soon.

Comment by nasf (Inactive) [ 12/Sep/13 ]

This is the patch:
http://review.whamcloud.com/#/c/7625/

Comment by Christopher Morrone [ 12/Sep/13 ]

FYI, the scrub that I inadvertently started finished after a little over 4 hours (just shy of 300 million files scanned, and a little over 12 million files updated). It looks like the directories are all accessible again, so I think this filesystem is OK again.

Comment by nasf (Inactive) [ 12/Sep/13 ]

It is really good news. Anyway we need above patch to make the auto detect mechanism more robust to avoid similar issues next time.

Comment by Christopher Morrone [ 12/Sep/13 ]

What will happen when the automatic OI scrub is made to work?

When we boot our MGS/MDS node after upgrading the software what will we expect to see? Does the OI scrub make the mount of the MDT hang for several hours, or does it happen in the background?

If the OI scrub happens in the background and clients are permitted to mount the filesystem, I presume that there would be a period of time when users would still see inaccessible files and directories.

Comment by nasf (Inactive) [ 13/Sep/13 ]

The full system OI scrub will run at background. So it will not cause MDT hang. If some client access the system before the OI scrub finished, then there will be several cases:

1) name-based accessing. Means the client send lookup by name firstly, and then access the object by the returned FID. It works.

2) FID-based accessing. Means client connected to the MDT before, and has known the object, and cache its FID before MDT upgrading, and send the cached FID directly to the MDT after MDT remount for upgrading.
2.1) If such FID mapping has been processed by OI scrub already, then it works.
2.2) Otherwise, the client may get failures.

In theory, the MDT can revoke all the locks hold by the client if found upgrading. But there are race cases that the current using FIDs by the client also may hit above failures.

Comment by nasf (Inactive) [ 21/Sep/13 ]

The patch for master to detect the upgrading:

http://review.whamcloud.com/#/c/7719/

Comment by Jodi Levi (Inactive) [ 24/Sep/13 ]

Patch landed to Master so closing ticket. Please let me know if anything additional is needed and I will reopen

Comment by Christopher Morrone [ 24/Sep/13 ]

It looks like the 2.4 patch assumes the existence of this patch:

448a0fb 2013-08-08 LU-3420 scrub: trigger OI scrub properly

which did not exists at 2.4.0. I assume that you suggest that I cherry pick that as well?

Comment by nasf (Inactive) [ 25/Sep/13 ]

Firstly, you need this patch (http://review.whamcloud.com/#/c/7625/) on Lustre-2.4 to resolve LU-3934.

Then, if possible, please consider the patch (http://review.whamcloud.com/#/c/6515/) also, which mainly focus on triggering OI scrub properly under DNE mode. The patch is based on master (Lustre-2.5). I am not sure whether it can be applied on your patches stack directly or not, please try. If cannot, we can back-port.

Comment by Christopher Morrone [ 25/Sep/13 ]

http://review.whamcloud.com/#/c/6515/ was also landed on b2_4, and you therefore based http://review.whamcloud.com/#/c/7625/ on that. 6515 does not apply cleanly without 7625. I'll just take both.

Comment by nasf (Inactive) [ 26/Sep/13 ]

6515 has been on b2_4 already, but not on b2_4_0, so you need to backport 6515 to b2_4_0, then apply 7625.

Comment by Peter Jones [ 26/Sep/13 ]

Closing as LLNL have pulled the fix(es) into their release and the fix is landed for 2.5.0

Generated at Sat Feb 10 01:38:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.