Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.15.3
-
None
-
CentOS 7.9 (3.10.0-1160.90.1.el7_lustre.pl1.x86_64)
-
3
-
9223372036854775807
Description
Happy New Year
We are seeing a new problem on our Fir filesystem (full 2.15.3) when lfs migrating some files. The symptom is ENOSPC when trying to lfs migrate, which makes me think of LU-12852, here is an example:
[root@fir-rbh03 ~]# lfs migrate -c 1 /fir/users/anovosel/Seisbench_DATA/stead_mem.csv lfs migrate: cannot get group lock: No space left on device (28) error: lfs migrate: /fir/users/anovosel/Seisbench_DATA/stead_mem.csv: cannot get group lock: No space left on device
These files are using PFL and a common point between them is that both first and second components are initialized but NOT the last one. For example:
[root@fir-rbh03 ~]# lfs getstripe /fir/users/anovosel/Seisbench_DATA/stead_mem.csv /fir/users/anovosel/Seisbench_DATA/stead_mem.csv lcm_layout_gen: 6 lcm_mirror_count: 1 lcm_entry_count: 3 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 4194304 lmm_stripe_count: 1 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 125 lmm_pool: ssd lmm_objects: - 0: { l_ost_idx: 125, l_fid: [0x1007d0000:0x3e14ca6:0x0] } lcme_id: 2 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 4194304 lcme_extent.e_end: 17179869184 lmm_stripe_count: 2 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 74 lmm_pool: hdd lmm_objects: - 0: { l_ost_idx: 74, l_fid: [0x1004a0000:0x778f9c9:0x0] } - 1: { l_ost_idx: 75, l_fid: [0x1004b0000:0x73f371a:0x0] } lcme_id: 3 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 17179869184 lcme_extent.e_end: EOF lmm_stripe_count: 16 lmm_stripe_size: 4194304 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: -1 lmm_pool: hdd
We have four ldiskfs MDTs, and I have example of files like that on MDT 0, 2 and 3. We don't have the ea_inode feature set but our inode size is 1KB:
[root@fir-md1-s1 Seisbench_DATA]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0 dumpe2fs 1.47.0-wc2 (25-May-2023) Filesystem volume name: fir-MDT0000 Last mounted on: / Filesystem UUID: 2f44ac0b-e931-4a58-90a4-d4f1765176bb Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata large_dir sparse_super large_file huge_file uninit_bg dir_nlink quota project Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3745217760 Block count: 4681213440 Reserved block count: 234060672 Free blocks: 3721821762 Free inodes: 3623118029 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 26216 Inode blocks per group: 6554 Flex block group size: 16 Filesystem created: Tue Dec 1 09:29:39 2020 Last mount time: Wed Jul 5 22:09:02 2023 Last write time: Wed Jul 5 22:09:02 2023 Mount count: 26 Maximum mount count: -1 Last checked: Tue Dec 1 09:29:39 2020 Check interval: 0 (<none>) Lifetime writes: 35 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 1024 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b8d9b0f5-1004-482d-83a0-44b8305a24cd Journal backup: inode blocks MMP block number: 28487 MMP update interval: 5 User quota inode: 3 Group quota inode: 4 Project quota inode: 12 Journal features: journal_incompat_revoke journal_64bit Total journal size: 4096M Total journal blocks: 1048576 Max transaction length: 1048576 Fast commit length: 0 Journal sequence: 0x0e6dad3b Journal start: 356385 MMP_block: mmp_magic: 0x4d4d50 mmp_check_interval: 10 mmp_sequence: 0x3131f5 mmp_update_date: Mon Jan 8 11:02:45 2024 mmp_update_time: 1704740565 mmp_node_name: fir-md1-s1 mmp_device_name: dm-0
Under ldiskfs:
[root@fir-md1-s1 Seisbench_DATA]# pwd /mnt/fir/ldiskfs/mdt/0/ROOT/users/[0x200000400:0x5:0x0]:0/anovosel/Seisbench_DATA [root@fir-md1-s1 Seisbench_DATA]# stat stead_mem.csv File: ‘stead_mem.csv’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 419466 Links: 1 Access: (0644/-rw-r--r--) Uid: (419500/anovosel) Gid: (18036/ beroza) Access: 2023-10-12 17:49:40.000000000 -0700 Modify: 2023-10-11 15:45:05.000000000 -0700 Change: 2023-10-30 04:12:08.000000000 -0700 Birth: - [root@fir-md1-s1 Seisbench_DATA]# getfattr -m '.*' -d stead_mem.csv # file: stead_mem.csv trusted.link=0s3/HqEQEAAAA3AAAAAAAAAAAAAAAAAAAAAB8AAAACAAU7gAAAQQIAAAAAc3RlYWRfbWVtLmNzdg== trusted.lma=0sAAAAAAAAAADBnAUAAgAAABcFAAAAAAAA trusted.lov=0s0AvWC6ABAAAGAAAAAAADAAAAAAAAAAAAAAAAAAAAAAABAAAAEAAAAAAAAAAAAAAAAABAAAAAAACwAAAASAAAAAAAAABrAAAAAAAAAAAAAAACAAAAEAAAAAAAQAAAAAAAAAAAAAQAAAD4AAAAYAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAAAAAAAAAEAAAA//////////9YAQAASAAAAAAAAAAAAAAAAAAAAAAAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAQAAAHNzZAAAAAAAAAAAAAAAAACmTOEDAAAAAAAAAAAAAAAAAAAAAH0AAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAgAAAGhkZAAQAP//AAAAAAAAAADJ+XgHAAAAAAAAAAAAAAAAAAAAAEoAAAAaNz8HAAAAAAAAAAAAAAAAAAAAAEsAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAEAD//2hkZAD/////IGiiAv////8AAAAAAAAAAAAAAAB1AAAAN8UpBv////8= trusted.projid="419500" trusted.som=0sBAAAAAAAAADUpL4WAAAAAGhfCwAAAAAA
Out of tens of millions of files migrated like that in the last months, I could find only a few hundreds like this, so it's rare and appeared only recently with 2.15.3. We have to replace old storage chassis so won't have much time to troubleshoot, so let me know if you think of anything I could try. My current workaround for this problem is to make a copy + unlink the files manually instead.
Note: the hdd pool (last component) only have OSTs with max_create_count=0 but this PFL setting is very common and worked on many other files.