Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.15.3
-
None
-
CentOS 7.9 (3.10.0-1160.90.1.el7_lustre.pl1.x86_64)
-
3
-
9223372036854775807
Description
Happy New Year ![]()
We are seeing a new problem on our Fir filesystem (full 2.15.3) when lfs migrating some files. The symptom is ENOSPC when trying to lfs migrate, which makes me think of LU-12852, here is an example:
[root@fir-rbh03 ~]# lfs migrate -c 1 /fir/users/anovosel/Seisbench_DATA/stead_mem.csv lfs migrate: cannot get group lock: No space left on device (28) error: lfs migrate: /fir/users/anovosel/Seisbench_DATA/stead_mem.csv: cannot get group lock: No space left on device
These files are using PFL and a common point between them is that both first and second components are initialized but NOT the last one. For example:
[root@fir-rbh03 ~]# lfs getstripe /fir/users/anovosel/Seisbench_DATA/stead_mem.csv
/fir/users/anovosel/Seisbench_DATA/stead_mem.csv
lcm_layout_gen: 6
lcm_mirror_count: 1
lcm_entry_count: 3
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 4194304
lmm_stripe_count: 1
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 125
lmm_pool: ssd
lmm_objects:
- 0: { l_ost_idx: 125, l_fid: [0x1007d0000:0x3e14ca6:0x0] }
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 4194304
lcme_extent.e_end: 17179869184
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 74
lmm_pool: hdd
lmm_objects:
- 0: { l_ost_idx: 74, l_fid: [0x1004a0000:0x778f9c9:0x0] }
- 1: { l_ost_idx: 75, l_fid: [0x1004b0000:0x73f371a:0x0] }
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 17179869184
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
lmm_pool: hdd
We have four ldiskfs MDTs, and I have example of files like that on MDT 0, 2 and 3. We don't have the ea_inode feature set but our inode size is 1KB:
[root@fir-md1-s1 Seisbench_DATA]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0
dumpe2fs 1.47.0-wc2 (25-May-2023)
Filesystem volume name: fir-MDT0000
Last mounted on: /
Filesystem UUID: 2f44ac0b-e931-4a58-90a4-d4f1765176bb
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata large_dir sparse_super large_file huge_file uninit_bg dir_nlink quota project
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 3745217760
Block count: 4681213440
Reserved block count: 234060672
Free blocks: 3721821762
Free inodes: 3623118029
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 26216
Inode blocks per group: 6554
Flex block group size: 16
Filesystem created: Tue Dec 1 09:29:39 2020
Last mount time: Wed Jul 5 22:09:02 2023
Last write time: Wed Jul 5 22:09:02 2023
Mount count: 26
Maximum mount count: -1
Last checked: Tue Dec 1 09:29:39 2020
Check interval: 0 (<none>)
Lifetime writes: 35 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 1024
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b8d9b0f5-1004-482d-83a0-44b8305a24cd
Journal backup: inode blocks
MMP block number: 28487
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Project quota inode: 12
Journal features: journal_incompat_revoke journal_64bit
Total journal size: 4096M
Total journal blocks: 1048576
Max transaction length: 1048576
Fast commit length: 0
Journal sequence: 0x0e6dad3b
Journal start: 356385
MMP_block:
mmp_magic: 0x4d4d50
mmp_check_interval: 10
mmp_sequence: 0x3131f5
mmp_update_date: Mon Jan 8 11:02:45 2024
mmp_update_time: 1704740565
mmp_node_name: fir-md1-s1
mmp_device_name: dm-0
Under ldiskfs:
[root@fir-md1-s1 Seisbench_DATA]# pwd /mnt/fir/ldiskfs/mdt/0/ROOT/users/[0x200000400:0x5:0x0]:0/anovosel/Seisbench_DATA [root@fir-md1-s1 Seisbench_DATA]# stat stead_mem.csv File: ‘stead_mem.csv’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 419466 Links: 1 Access: (0644/-rw-r--r--) Uid: (419500/anovosel) Gid: (18036/ beroza) Access: 2023-10-12 17:49:40.000000000 -0700 Modify: 2023-10-11 15:45:05.000000000 -0700 Change: 2023-10-30 04:12:08.000000000 -0700 Birth: - [root@fir-md1-s1 Seisbench_DATA]# getfattr -m '.*' -d stead_mem.csv # file: stead_mem.csv trusted.link=0s3/HqEQEAAAA3AAAAAAAAAAAAAAAAAAAAAB8AAAACAAU7gAAAQQIAAAAAc3RlYWRfbWVtLmNzdg== trusted.lma=0sAAAAAAAAAADBnAUAAgAAABcFAAAAAAAA trusted.lov=0s0AvWC6ABAAAGAAAAAAADAAAAAAAAAAAAAAAAAAAAAAABAAAAEAAAAAAAAAAAAAAAAABAAAAAAACwAAAASAAAAAAAAABrAAAAAAAAAAAAAAACAAAAEAAAAAAAQAAAAAAAAAAAAAQAAAD4AAAAYAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAAAAAAAAAEAAAA//////////9YAQAASAAAAAAAAAAAAAAAAAAAAAAAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAQAAAHNzZAAAAAAAAAAAAAAAAACmTOEDAAAAAAAAAAAAAAAAAAAAAH0AAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAgAAAGhkZAAQAP//AAAAAAAAAADJ+XgHAAAAAAAAAAAAAAAAAAAAAEoAAAAaNz8HAAAAAAAAAAAAAAAAAAAAAEsAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAEAD//2hkZAD/////IGiiAv////8AAAAAAAAAAAAAAAB1AAAAN8UpBv////8= trusted.projid="419500" trusted.som=0sBAAAAAAAAADUpL4WAAAAAGhfCwAAAAAA
Out of tens of millions of files migrated like that in the last months, I could find only a few hundreds like this, so it's rare and appeared only recently with 2.15.3. We have to replace old storage chassis so won't have much time to troubleshoot, so let me know if you think of anything I could try. My current workaround for this problem is to make a copy + unlink the files manually instead.
Note: the hdd pool (last component) only have OSTs with max_create_count=0 but this PFL setting is very common and worked on many other files.