[LU-16655] Files not accessible after 2.12 -> 2.14/2.15 upgrade Created: 22/Mar/23  Updated: 29/Jan/24  Resolved: 18/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Blocker
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14864 osd_fid_lookup() ASSERTION( !updated ... Open
is related to LU-16677 wirecheck.c not compilable and has di... Resolved
is related to LU-16689 upgrade to 2.15.2 lost sever top leve... Resolved
is related to LU-16873 LU-16655 fix doesn't update OI_Scrub ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Most of the files are not accessible after 2.12 -> 2.15 upgrade:

[root@hpcln01 ~]#  ls /mnt/lustre
ls: cannot access '/mnt/lustre/README': No such file or directory
ls: cannot access '/mnt/lustre/other': No such file or directory
ls: cannot access '/mnt/lustre/vol0': No such file or directory
ls: cannot access '/mnt/lustre/fc': No such file or directory
README  fc  other  vol0
[root@hpcln01 ~]#  ls -l /mnt/lustre
ls: cannot access '/mnt/lustre/README': No such file or directory
ls: cannot access '/mnt/lustre/other': No such file or directory
ls: cannot access '/mnt/lustre/vol0': No such file or directory
ls: cannot access '/mnt/lustre/fc': No such file or directory
total 0
-????????? ? ? ? ?            ? README
d????????? ? ? ? ?            ? fc
d????????? ? ? ? ?            ? other
d????????? ? ? ? ?            ? vol0
[root@hpcln01 ~]#

The system had been functioning correctly before the upgrade.

The server's kernel log contains messages like the following:

Mar 20 11:33:04 hpcln02 kernel: LustreError: 1764712:0:(mdt_open.c:1519:mdt_reint_open()) hpcln-MDT0000: name 'README' present, but FID [0x200016cab:0x1e5:0x0] is invalid


 Comments   
Comment by Alexander Zarochentsev [ 22/Mar/23 ]

The reason for the fs corruption is that information about number of OI files on the target was lost during upgrade:

osd-ldiskfs.hpcln-MDT0000.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 1
status: completed
flags:
param:
time_since_last_completed: 547 seconds
time_since_latest_start: 4525 seconds
time_since_last_checkpoint: 547 seconds
latest_start_position: 12
last_checkpoint_position: 4294148041
first_failure_position: N/A
checked: 52509028
updated: 51670016
failed: 0
prior_updated: 0
noscrub: 38
igif: 1
success_count: 1
run_time: 3978 seconds
average_speed: 13199 objects/sec
real_time_speed: N/A
current_position: N/A
lf_scanned: 2777
lf_repaired: 0
lf_failed: 0

number of OI files became 1, i.e. the system forgot about 63 out of 64 OI files and about all FID->mapping info in those files.

After some investigation I found a not intentioned on-disk format change, specifically in struct scrub_file.

commit a114f6b8c52bb97dcad093e5cb32b903ae9f602f
Author: James Simmons <jsimmons@infradead.org>
Date:   Tue Mar 17 07:31:08 2020 -0400

    LU-13344 servers: change request timeouts to s32
    
    In the latest kernels time_t has been removed since time_t has
    been a 64 bit value just like time64_t so no need for it anymore.
    The field sf_run_time is stored on local media and its expected
    to be 32 bit in size so change it to s32 explictly. The fields
    the MDT coordinator that are currently time_t are a simple change
    to s32 since they represent a time period and not a time stamp.
    Lastly just update a comment in llog_swab.c since time_t will no
    longer exist but we want to point out time in the llog is 32 bits
    in size.
    
    Change-Id: I0849fc9798c191d1fed77f7bba1357a209924418
    Signed-off-by: James Simmons <jsimmons@infradead.org>
    Reviewed-on: https://review.whamcloud.com/37950
    Tested-by: jenkins <devops@whamcloud.com>
    Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
    Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
    Tested-by: Maloo <maloo@whamcloud.com>
    Reviewed-by: Oleg Drokin <green@whamcloud.com>

diff --git a/lustre/include/lustre_scrub.h b/lustre/include/lustre_scrub.h
index 38d4fb5b44..313af0f9eb 100644
--- a/lustre/include/lustre_scrub.h
+++ b/lustre/include/lustre_scrub.h
@@ -240,7 +240,7 @@ struct scrub_file {
         * to time64_t since this breaks backwards compatibility.
         * It shouldn't take more than 136 years to complete :-)
         */
-       time_t  sf_run_time;
+       s32     sf_run_time;
 
        /* How many completed OI scrub ran on the device. */
        __u32   sf_success_count;
...

Length of the sf_run_time field was changed without adding a padding , so the sf_oi_count field changed its offset.

For some systems after upgrade to 2.15, an incorrect value in sf_oi_count may successfully pass sanity checks in osd_oi_init() and became new value of sf_oi_count, with consequences showed in the description of this ticket.

Comment by Andreas Dilger [ 22/Mar/23 ]

The 2.14+ struct scrub_file format is actually better than the 2.12 one, because the fields are properly aligned and it does not have a hole in it. The 2.12.x struct format according to pahole is:

        /* --- cacheline 2 boundary (128 bytes) --- */
        time_t                     sf_run_time;          /*   128     8 */
        __u32                      sf_success_count;     /*   136     4 */
        __u16                      sf_oi_count;          /*   140     2 */
        __u16                      sf_internal_flags;    /*   142     2 */
        __u32                      sf_reserved_1;        /*   144     4 */

        /* XXX 4 bytes hole, try to pack */

while the 2.14.x and later (and 2.10 and earlier, see below) format is:

        /* --- cacheline 2 boundary (128 bytes) --- */
        s32                        sf_run_time;          /*   128     4 */
        __u32                      sf_success_count;     /*   132     4 */
        __u16                      sf_oi_count;          /*   136     2 */
        __u16                      sf_internal_flags;    /*   138     2 */
        __u32                      sf_reserved_1;        /*   140     4 */

Ironically, the comment in that struct says:

        /* How long the OI scrub has run in seconds. Do NOT change
         * to time64_t since this breaks backwards compatibility.
         * It shouldn't take more than 136 years to complete :-)
         */
        time_t  sf_run_time;

Unfortunately, "time_t" is itself a signed long, so it was a 64-bit value despite what the comment says. It looks like that was itself changed in commit v2_10_58_0-54-g4c2f028a95 so it appears that upgrades from 2.10 to 2.12 may also have had a similar issue:

@@ -237,8 +235,11 @@ struct scrub_file {
        /* How many IGIF objects. */
        __u64   sf_items_igif;
 
-       /* How long the OI scrub has run. */
-       __u32   sf_run_time;
+       /* How long the OI scrub has run in seconds. Do NOT change
+        * to time64_t since this breaks backwards compatibility.
+        * It shouldn't take more than 136 years to complete :-)
+        */
+       time_t  sf_run_time;

What it looks like is happening is that the sf_oi_count in memory is being shifted to read from sf_success_count on disk in the 2.12->2.14+ upgrade case. Many filesystems will never have run OI Scrub, so they will have sf_success_count=0. Some filesystems will a low number of OI Scrub runs, so "sf_oi_count = sf_success_count = 1" or similar is used. In some cases, osd_oi_init() will detect a non power-of-two value in sf_oi_count with an error like "testfs-MDT0000: Invalid OI count in scrub file 5", and reset it to 0 (which will try to read all of the oi.16.* files). In the 2.10->2.12 upgrade case this bug was not noticed, since it shifted sf_internal_flags (always zero?) to sf_oi_count and all oi.16.* files were read.

On a related note, it does look like osd_oi_table_open() can trigger "LASSERT(oi_count <= OSD_OI_FID_NR_MAX)" in osd_oi_table_open() if there is bad data in the OI_scrub file (e.g. upgrade system with sf_success_count > 1024), so this should also be checked in osd_oi_init() and reset sf_oi_count=0 in this case (and potentially any other important fields in this struct):

-        if (unlikely(sf->sf_oi_count & (sf->sf_oi_count - 1)) != 0) {
+        if (unlikely(sf->sf_oi_count & (sf->sf_oi_count - 1) != 0 ||
+                     sf->sf_oi_count > OSD_OI_FID_NR_MAX)) {
                sf->sf_oi_count = 0;
        }

One option to fix this problem would be to change struct scrub_file.sf_magic to use a new value, which could be used to detect "bad" scrub files and reset the struct and start scrub from the beginning:

#define SCRUB_MAGIC_V1                  0x4C5FD252
#define SCRUB_MAGIC_V2                  0x4C5FD253

int scrub_file_load(const struct lu_env *env, struct lustre_scrub *scrub)
{
        if (sf->sf_magic == SCRUB_MAGIC_V1) {
                sf->sf_oi_count = 0;  /* reset OI count, LU-16655 */
        } else if (sf->sf_magic != SCRUB_MAGIC_V2) {
                CDEBUG(D_LFSCK, "%s: invalid scrub magic 0x%x != 0x%x\n",
                       scrub->os_name, sf->sf_magic, SCRUB_MAGIC_V2);
                return -EFAULT;
        }

It seems possible to "fix" old OI_scrub files by just resetting the "sf_oi_scrub=0" if SCRUB_MAGIC_V1 is found.

The OI_scrub file itself is only useful for temporarily storing the scrub state so that scrub doesn't start from inode 1 on each mount. This is useful if scrub is taking a long time to complete, or the server crashes in the middle. Resetting the scrub state on an upgrade is totally fine and would avoid this issue.

Alternately, or in addition, it makes sense for osd_oi_table_open() to not blindly trust sf_oi_count and try to access at least the default number of OI files like osd_oi_probe() so that it doesn't do something similarly bad to the OI files if the OI_scrub file is corrupted on disk in some other way.

Since this is an on-disk structure it should be checked in wirecheck/wiretest to avoid such breakage in the future. It may also make sense to make "struct scrub_file" a fixed size (e.g. 1KB) and also store the magic at the end to be able to detect other kinds of corruption.

Comment by Gerrit Updater [ 28/Mar/23 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50455
Subject: LU-16655 scrub: upgrade scrub_file from 2.12 format
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bc6debf1a352611a22e9024443d9422c42f3e64d

Comment by Gerrit Updater [ 31/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50480
Subject: LU-16655 scrub: upgrade scrub_file from 2.12 format
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: e0359d4746e54daeab01f9f173a3db3d42476adc

Comment by Gerrit Updater [ 31/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50482
Subject: LU-16655 build: add wirecheck/wiretest for disk structs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 357f65dedeb6637171d1e7bc9da2591e9a15d66c

Comment by Etienne Aujames [ 31/Mar/23 ]

The issue related to the change of sf_run_time format between 2.10 to 2.12 (from s32 to time_t) was:

  • LU-14864: osd_fid_lookup() ASSERTION( !updated ) failed
Comment by Gerrit Updater [ 31/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50496
Subject: LU-16655 scrub: change sf_uuid to guid_t
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ccc72993aaddbd42bb2692848784690086b8e241

Comment by Gerrit Updater [ 04/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50480/
Subject: LU-16655 scrub: upgrade scrub_file from 2.12 format
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: a209505361cf858ec91b0a89992691c5b005ab72

Comment by Gerrit Updater [ 05/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50455/
Subject: LU-16655 scrub: upgrade scrub_file from 2.12 format
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 126275ba8339540e46f1c517decd3d69ad1cc42c

Comment by Gerrit Updater [ 18/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50496/
Subject: LU-16655 scrub: change sf_uuid to guid_t
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 20621c761b787cd7b08a9d0abc0fc362f0bb0122

Comment by Gerrit Updater [ 18/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50482/
Subject: LU-16655 build: add wirecheck/wiretest for disk structs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a3a713c8e1a47edc625963092b6b3f27079cdf2d

Comment by Peter Jones [ 18/Apr/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:28:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.