[LU-16655] Files not accessible after 2.12 -> 2.14/2.15 upgrade Created: 22/Mar/23 Updated: 29/Jan/24 Resolved: 18/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Most of the files are not accessible after 2.12 -> 2.15 upgrade: [root@hpcln01 ~]# ls /mnt/lustre ls: cannot access '/mnt/lustre/README': No such file or directory ls: cannot access '/mnt/lustre/other': No such file or directory ls: cannot access '/mnt/lustre/vol0': No such file or directory ls: cannot access '/mnt/lustre/fc': No such file or directory README fc other vol0 [root@hpcln01 ~]# ls -l /mnt/lustre ls: cannot access '/mnt/lustre/README': No such file or directory ls: cannot access '/mnt/lustre/other': No such file or directory ls: cannot access '/mnt/lustre/vol0': No such file or directory ls: cannot access '/mnt/lustre/fc': No such file or directory total 0 -????????? ? ? ? ? ? README d????????? ? ? ? ? ? fc d????????? ? ? ? ? ? other d????????? ? ? ? ? ? vol0 [root@hpcln01 ~]# The system had been functioning correctly before the upgrade. The server's kernel log contains messages like the following: Mar 20 11:33:04 hpcln02 kernel: LustreError: 1764712:0:(mdt_open.c:1519:mdt_reint_open()) hpcln-MDT0000: name 'README' present, but FID [0x200016cab:0x1e5:0x0] is invalid |
| Comments |
| Comment by Alexander Zarochentsev [ 22/Mar/23 ] |
|
The reason for the fs corruption is that information about number of OI files on the target was lost during upgrade: osd-ldiskfs.hpcln-MDT0000.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 1 status: completed flags: param: time_since_last_completed: 547 seconds time_since_latest_start: 4525 seconds time_since_last_checkpoint: 547 seconds latest_start_position: 12 last_checkpoint_position: 4294148041 first_failure_position: N/A checked: 52509028 updated: 51670016 failed: 0 prior_updated: 0 noscrub: 38 igif: 1 success_count: 1 run_time: 3978 seconds average_speed: 13199 objects/sec real_time_speed: N/A current_position: N/A lf_scanned: 2777 lf_repaired: 0 lf_failed: 0 number of OI files became 1, i.e. the system forgot about 63 out of 64 OI files and about all FID->mapping info in those files. After some investigation I found a not intentioned on-disk format change, specifically in struct scrub_file. commit a114f6b8c52bb97dcad093e5cb32b903ae9f602f
Author: James Simmons <jsimmons@infradead.org>
Date: Tue Mar 17 07:31:08 2020 -0400
LU-13344 servers: change request timeouts to s32
In the latest kernels time_t has been removed since time_t has
been a 64 bit value just like time64_t so no need for it anymore.
The field sf_run_time is stored on local media and its expected
to be 32 bit in size so change it to s32 explictly. The fields
the MDT coordinator that are currently time_t are a simple change
to s32 since they represent a time period and not a time stamp.
Lastly just update a comment in llog_swab.c since time_t will no
longer exist but we want to point out time in the llog is 32 bits
in size.
Change-Id: I0849fc9798c191d1fed77f7bba1357a209924418
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/37950
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
diff --git a/lustre/include/lustre_scrub.h b/lustre/include/lustre_scrub.h
index 38d4fb5b44..313af0f9eb 100644
--- a/lustre/include/lustre_scrub.h
+++ b/lustre/include/lustre_scrub.h
@@ -240,7 +240,7 @@ struct scrub_file {
* to time64_t since this breaks backwards compatibility.
* It shouldn't take more than 136 years to complete :-)
*/
- time_t sf_run_time;
+ s32 sf_run_time;
/* How many completed OI scrub ran on the device. */
__u32 sf_success_count;
...
Length of the sf_run_time field was changed without adding a padding , so the sf_oi_count field changed its offset. For some systems after upgrade to 2.15, an incorrect value in sf_oi_count may successfully pass sanity checks in osd_oi_init() and became new value of sf_oi_count, with consequences showed in the description of this ticket. |
| Comment by Andreas Dilger [ 22/Mar/23 ] |
|
The 2.14+ struct scrub_file format is actually better than the 2.12 one, because the fields are properly aligned and it does not have a hole in it. The 2.12.x struct format according to pahole is:
/* --- cacheline 2 boundary (128 bytes) --- */
time_t sf_run_time; /* 128 8 */
__u32 sf_success_count; /* 136 4 */
__u16 sf_oi_count; /* 140 2 */
__u16 sf_internal_flags; /* 142 2 */
__u32 sf_reserved_1; /* 144 4 */
/* XXX 4 bytes hole, try to pack */
while the 2.14.x and later (and 2.10 and earlier, see below) format is:
/* --- cacheline 2 boundary (128 bytes) --- */
s32 sf_run_time; /* 128 4 */
__u32 sf_success_count; /* 132 4 */
__u16 sf_oi_count; /* 136 2 */
__u16 sf_internal_flags; /* 138 2 */
__u32 sf_reserved_1; /* 140 4 */
Ironically, the comment in that struct says:
/* How long the OI scrub has run in seconds. Do NOT change
* to time64_t since this breaks backwards compatibility.
* It shouldn't take more than 136 years to complete :-)
*/
time_t sf_run_time;
Unfortunately, "time_t" is itself a signed long, so it was a 64-bit value despite what the comment says. It looks like that was itself changed in commit v2_10_58_0-54-g4c2f028a95 so it appears that upgrades from 2.10 to 2.12 may also have had a similar issue:
@@ -237,8 +235,11 @@ struct scrub_file {
/* How many IGIF objects. */
__u64 sf_items_igif;
- /* How long the OI scrub has run. */
- __u32 sf_run_time;
+ /* How long the OI scrub has run in seconds. Do NOT change
+ * to time64_t since this breaks backwards compatibility.
+ * It shouldn't take more than 136 years to complete :-)
+ */
+ time_t sf_run_time;
What it looks like is happening is that the sf_oi_count in memory is being shifted to read from sf_success_count on disk in the 2.12->2.14+ upgrade case. Many filesystems will never have run OI Scrub, so they will have sf_success_count=0. Some filesystems will a low number of OI Scrub runs, so "sf_oi_count = sf_success_count = 1" or similar is used. In some cases, osd_oi_init() will detect a non power-of-two value in sf_oi_count with an error like "testfs-MDT0000: Invalid OI count in scrub file 5", and reset it to 0 (which will try to read all of the oi.16.* files). In the 2.10->2.12 upgrade case this bug was not noticed, since it shifted sf_internal_flags (always zero?) to sf_oi_count and all oi.16.* files were read. On a related note, it does look like osd_oi_table_open() can trigger "LASSERT(oi_count <= OSD_OI_FID_NR_MAX)" in osd_oi_table_open() if there is bad data in the OI_scrub file (e.g. upgrade system with sf_success_count > 1024), so this should also be checked in osd_oi_init() and reset sf_oi_count=0 in this case (and potentially any other important fields in this struct): - if (unlikely(sf->sf_oi_count & (sf->sf_oi_count - 1)) != 0) { + if (unlikely(sf->sf_oi_count & (sf->sf_oi_count - 1) != 0 || + sf->sf_oi_count > OSD_OI_FID_NR_MAX)) { sf->sf_oi_count = 0; } One option to fix this problem would be to change struct scrub_file.sf_magic to use a new value, which could be used to detect "bad" scrub files and reset the struct and start scrub from the beginning: #define SCRUB_MAGIC_V1 0x4C5FD252 #define SCRUB_MAGIC_V2 0x4C5FD253 int scrub_file_load(const struct lu_env *env, struct lustre_scrub *scrub) { if (sf->sf_magic == SCRUB_MAGIC_V1) { sf->sf_oi_count = 0; /* reset OI count, LU-16655 */ } else if (sf->sf_magic != SCRUB_MAGIC_V2) { CDEBUG(D_LFSCK, "%s: invalid scrub magic 0x%x != 0x%x\n", scrub->os_name, sf->sf_magic, SCRUB_MAGIC_V2); return -EFAULT; } It seems possible to "fix" old OI_scrub files by just resetting the "sf_oi_scrub=0" if SCRUB_MAGIC_V1 is found. The OI_scrub file itself is only useful for temporarily storing the scrub state so that scrub doesn't start from inode 1 on each mount. This is useful if scrub is taking a long time to complete, or the server crashes in the middle. Resetting the scrub state on an upgrade is totally fine and would avoid this issue. Alternately, or in addition, it makes sense for osd_oi_table_open() to not blindly trust sf_oi_count and try to access at least the default number of OI files like osd_oi_probe() so that it doesn't do something similarly bad to the OI files if the OI_scrub file is corrupted on disk in some other way. Since this is an on-disk structure it should be checked in wirecheck/wiretest to avoid such breakage in the future. It may also make sense to make "struct scrub_file" a fixed size (e.g. 1KB) and also store the magic at the end to be able to detect other kinds of corruption. |
| Comment by Gerrit Updater [ 28/Mar/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50455 |
| Comment by Gerrit Updater [ 31/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50480 |
| Comment by Gerrit Updater [ 31/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50482 |
| Comment by Etienne Aujames [ 31/Mar/23 ] |
|
The issue related to the change of sf_run_time format between 2.10 to 2.12 (from s32 to time_t) was:
|
| Comment by Gerrit Updater [ 31/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50496 |
| Comment by Gerrit Updater [ 04/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50480/ |
| Comment by Gerrit Updater [ 05/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50455/ |
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50496/ |
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50482/ |
| Comment by Peter Jones [ 18/Apr/23 ] |
|
Landed for 2.16 |