[LU-9788] upgrading ldiskfs on-disk format from 2.4.3 lustre version to 2.8.0 Created: 20/Jul/17  Updated: 10/Mar/18  Resolved: 10/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Major
Reporter: James A Simmons Assignee: Andreas Dilger
Resolution: Done Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7410 After downgrade from 2.8 to 2.5.5, hi... Resolved
is related to LU-8605 Downgrading from 2.8 to 2.5: fsck_nam... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

ORNL's main production file system was formatted during the 2.4.3 lustre time frame. Since then we have move to 2.5 and now to lustre 2.8.0 without updating the ldiskfs on line format. This ticket is a request into what has changed and the impact of the changes. Lastly we need to ensure the upgrade it correct when done.



 Comments   
Comment by Peter Jones [ 20/Jul/17 ]

Andreas

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 20/Jul/17 ]

We work hard to maintain upgrade and downgrade compatibility between Lustre releases for the on-disk format. New features that affect the on-disk format in a manner that prevents a downgrade to the previous Lustre version will typically require explicit action from the administrator to enable before it is used, to allow the system to be upgraded without affecting the disk format, and only enabling the new feature once the new Lustre release is known to be stable in your environment. 

It would be good to get the output of "dumpe2fs -h" from the MDT and one OST (assuming they are the same) to see what ldiskfs features are currently enabled, and check if there may be performance improvements possible after the upgrade. 

Secondly, in addition to upgrading the servers, will you also be upgrading the clients, or will you be running with different client versions?  I believe that you may already be running 2.8 clients on your system. 

There are two issues that I'm aware of that would affect upgrade+downgrade to a 2.4 MDS. One is that the client multiple metadata request feature ("multi-slot last_rcvd") sets a flag on the MDT for the new recovery file format that prevents mounting on an unsupported version of Lustre (LU-7410). This has a simple workaround if it is hit if you need to downgrade, as described in that ticket.

The second is related to LFSCK (LU-8605), and can be avoided by having the fix applied to your 2.8 code before the upgrade. You may already have a fix for this issue.

If you want to be prudent, it makes sense to create a backup of the MDT prior to the upgrade. This can be done with "dd"of the raw MDT filesystem to.a backup device before installing the new Lustre release. DDN has also been testing the use of "e2image" to make copies of the ldiskfs metadata, which have the advantage of only backing up the in-use parts of the device, and are stored more compactly. It would be possible to make an e2image backup of the OSTs as well, since the actual space used would be relatively small.
 

Comment by Brad Hoagland (Inactive) [ 22/Aug/17 ]

Hi simmonsja,

Does this answer your question(s)?

Regards,

Brad

Comment by James A Simmons [ 31/Aug/17 ]

I need to talk to Andreas in detail about this.

Comment by James A Simmons [ 07/Sep/17 ]

Both our clients and server back end are running lustre 2.8.1. Its just the ldiskfs format that hasn't been upgraded since our 2.5 days. Okay here is the debugfs output from our MDS server:

[root@atlas1-mds1 ~]# dumpe2fs -h /dev/mapper/atlas1-mdt1
dumpe2fs 1.42.13.wc5 (15-Apr-2016)
Filesystem volume name: atlas1-MDT0000
Last mounted on: /
Filesystem UUID: 182b7f08-b0ff-4803-aa46-c110ac95acfe
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg ea_inode dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash
Default mount options: user_xattr
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1073741824
Block count: 536870912
Reserved block count: 26843545
Free blocks: 272442789
Free inodes: 682054233
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1024
Blocks per group: 16384
Fragments per group: 16384
Inodes per group: 32768
Inode blocks per group: 4096
Flex block group size: 16
Filesystem created: Tue Oct 1 11:02:25 2013
Last mount time: Tue Aug 22 12:14:37 2017
Last write time: Tue Aug 22 12:14:37 2017
Mount count: 4
Maximum mount count: -1
Last checked: Tue Jun 20 08:52:59 2017
Check interval: 0 (<none>)
Lifetime writes: 46 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 512
Required extra isize: 28
Desired extra isize: 28
Journal UUID: 33c503b6-30a3-481b-8ac6-121e2144bb79
Journal device: 0xfd06
Default directory hash: half_md4
Directory Hash Seed: bfef46fe-273d-43ea-95e3-d9ebcd11a516
Journal backup: inode blocks
User quota inode: 3
Group quota inode: 4

and here is the output from one of our OSS servers:

[root@atlas-oss1a1 ~]# dumpe2fs -h /dev/mapper/atlas-ddn1a-l0
dumpe2fs 1.42.13.wc5 (15-Apr-2016)
Filesystem volume name: atlas1-OST0000
Last mounted on: /
Filesystem UUID: f28766dc-9235-45e4-9ddc-f502ad57276c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 29343744
Block count: 3755999232
Reserved block count: 187799961
Free blocks: 1333048954
Free inodes: 28069821
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 127
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 256
Inode blocks per group: 16
Flex block group size: 256
Filesystem created: Tue Oct 1 11:02:26 2013
Last mount time: Tue Aug 22 12:14:42 2017
Last write time: Tue Aug 22 12:14:42 2017
Mount count: 23
Maximum mount count: -1
Last checked: Sun Feb 7 16:31:09 2016
Check interval: 0 (<none>)
Lifetime writes: 210 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: c9240b0e-be05-4fb9-a604-69e902802452
Journal backup: inode blocks
MMP block number: 5638
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Journal features: journal_incompat_revoke
Journal size: 400M
Journal length: 102400
Journal sequence: 0x01fe13aa
Journal start: 33589

Comment by Andreas Dilger [ 08/Sep/17 ]

If you are already running Lustre 2.8.x on the MDS/OSS then there isn't a huge amount to be done. You already have the dirdata feature (which has been available since 2.1) and flex_bg, which are the major performance gains compared to upgraded 1.8-formatted MDT filesystems. The above referenced issues were only relevant in case of a downgrade, but since you are already running 2.8, presumably without problems that would make you want to downgrade, I don't think they are relevant.

The only other possible issue that would come up going forward is the size of the inodes on the MDT and OST. With Lustre 2.10+ we have bumped the default MDT inode size to 1024 bytes (from 512) and the default OST inode size to 512 (from 256) to facilitate usage of PFL in the future. If you are going to make PFL layouts the default on an ldiskfs MDT then you might consider to do a backup/restore (the inode size can only be changed at format time), but this is a non-issue for ZFS (which has dynamic dnode sizing as of 0.7.x).

Comment by James A Simmons [ 19/Sep/17 ]

Yes are production system is running lustre 2.8 clients and lustre 2.8 servers. We have no plans with the current center wide file system to move to lustre 2.10. When the file system was created it was formated at a lustre 2.5 version. So we looking to see what needs to be abled to get to the 2.8 lustre support level. The big issue we have hit is when users create 20+ millions per directory which crashes are MDS server. I believe the large directory hash work in newer lustre versions fix this.

Comment by Andreas Dilger [ 19/Sep/17 ]

James, we've never supported more than ~10M files per directory with ldiskfs (https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#settinguplustresystem.tab2), except with DNE2 directories striped over multiple MDTs (each with < 10M entries). The per-directory limit depends on the length of the filenames being used. Hitting this limit definitely shouldn't crash the MDS, which should be filed as a separate ticket with stack traces, etc.

The large_dir feature hasn't been tested in production yet, as it has only landed in the development e2fsprogs release upstream, and the patches for e2fsprogs-wc need to be updated to include fixes made to those upstream patches. This definitely isn't something included as part of 2.8.

Comment by James A Simmons [ 26/Sep/17 ]

So even tho ldiskfs has the code to support large_dir it is off by default and has never been really tested. You can't even set large_dir with the current ldiskfs version of e2fsprogs?

Comment by Andreas Dilger [ 27/Sep/17 ]

Correct. Until recently, there was no e2fsck support for the large_dir feature, so it has not been safe to enable.

Comment by Andreas Dilger [ 10/Mar/18 ]

Closing this issue, I don't think there is anything here to be done.

Generated at Sat Feb 10 02:29:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.