Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17403

lfs migrate: cannot get group lock: No space left on device

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.3
    • None
    • CentOS 7.9 (3.10.0-1160.90.1.el7_lustre.pl1.x86_64)
    • 3
    • 9223372036854775807

    Description

      Happy New Year

      We are seeing a new problem on our Fir filesystem (full 2.15.3) when lfs migrating some files. The symptom is ENOSPC when trying to lfs migrate, which makes me think of LU-12852, here is an example:

      [root@fir-rbh03 ~]# lfs migrate -c 1 /fir/users/anovosel/Seisbench_DATA/stead_mem.csv
      lfs migrate: cannot get group lock: No space left on device (28)
      error: lfs migrate: /fir/users/anovosel/Seisbench_DATA/stead_mem.csv: cannot get group lock: No space left on device
      

      These files are using PFL and a common point between them is that both first and second components are initialized but NOT the last one. For example:

      [root@fir-rbh03 ~]# lfs getstripe /fir/users/anovosel/Seisbench_DATA/stead_mem.csv
      /fir/users/anovosel/Seisbench_DATA/stead_mem.csv
        lcm_layout_gen:    6
        lcm_mirror_count:  1
        lcm_entry_count:   3
          lcme_id:             1
          lcme_mirror_id:      0
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   4194304
            lmm_stripe_count:  1
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 125
            lmm_pool:          ssd
            lmm_objects:
            - 0: { l_ost_idx: 125, l_fid: [0x1007d0000:0x3e14ca6:0x0] }
      
          lcme_id:             2
          lcme_mirror_id:      0
          lcme_flags:          init
          lcme_extent.e_start: 4194304
          lcme_extent.e_end:   17179869184
            lmm_stripe_count:  2
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 74
            lmm_pool:          hdd
            lmm_objects:
            - 0: { l_ost_idx: 74, l_fid: [0x1004a0000:0x778f9c9:0x0] }
            - 1: { l_ost_idx: 75, l_fid: [0x1004b0000:0x73f371a:0x0] }
      
          lcme_id:             3
          lcme_mirror_id:      0
          lcme_flags:          0
          lcme_extent.e_start: 17179869184
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  16
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: -1
            lmm_pool:          hdd
      

      We have four ldiskfs MDTs, and I have example of files like that on MDT 0, 2 and 3. We don't have the ea_inode feature set but our inode size is 1KB:

      [root@fir-md1-s1 Seisbench_DATA]# dumpe2fs -h /dev/mapper/md1-rbod1-mdt0
      dumpe2fs 1.47.0-wc2 (25-May-2023)
      Filesystem volume name:   fir-MDT0000
      Last mounted on:          /
      Filesystem UUID:          2f44ac0b-e931-4a58-90a4-d4f1765176bb
      Filesystem magic number:  0xEF53
      Filesystem revision #:    1 (dynamic)
      Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit mmp flex_bg dirdata large_dir sparse_super large_file huge_file uninit_bg dir_nlink quota project
      Filesystem flags:         signed_directory_hash 
      Default mount options:    user_xattr acl
      Filesystem state:         clean
      Errors behavior:          Continue
      Filesystem OS type:       Linux
      Inode count:              3745217760
      Block count:              4681213440
      Reserved block count:     234060672
      Free blocks:              3721821762
      Free inodes:              3623118029
      First block:              0
      Block size:               4096
      Fragment size:            4096
      Group descriptor size:    64
      Blocks per group:         32768
      Fragments per group:      32768
      Inodes per group:         26216
      Inode blocks per group:   6554
      Flex block group size:    16
      Filesystem created:       Tue Dec  1 09:29:39 2020
      Last mount time:          Wed Jul  5 22:09:02 2023
      Last write time:          Wed Jul  5 22:09:02 2023
      Mount count:              26
      Maximum mount count:      -1
      Last checked:             Tue Dec  1 09:29:39 2020
      Check interval:           0 (<none>)
      Lifetime writes:          35 TB
      Reserved blocks uid:      0 (user root)
      Reserved blocks gid:      0 (group root)
      First inode:              11
      Inode size:	          1024
      Required extra isize:     32
      Desired extra isize:      32
      Journal inode:            8
      Default directory hash:   half_md4
      Directory Hash Seed:      b8d9b0f5-1004-482d-83a0-44b8305a24cd
      Journal backup:           inode blocks
      MMP block number:         28487
      MMP update interval:      5
      User quota inode:         3
      Group quota inode:        4
      Project quota inode:      12
      Journal features:         journal_incompat_revoke journal_64bit
      Total journal size:       4096M
      Total journal blocks:     1048576
      Max transaction length:   1048576
      Fast commit length:       0
      Journal sequence:         0x0e6dad3b
      Journal start:            356385
      MMP_block:
          mmp_magic: 0x4d4d50
          mmp_check_interval: 10
          mmp_sequence: 0x3131f5
          mmp_update_date: Mon Jan  8 11:02:45 2024
          mmp_update_time: 1704740565
          mmp_node_name: fir-md1-s1
          mmp_device_name: dm-0
      

      Under ldiskfs:

      [root@fir-md1-s1 Seisbench_DATA]# pwd
      /mnt/fir/ldiskfs/mdt/0/ROOT/users/[0x200000400:0x5:0x0]:0/anovosel/Seisbench_DATA
      
      [root@fir-md1-s1 Seisbench_DATA]# stat stead_mem.csv
        File: ‘stead_mem.csv’
        Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
      Device: fd00h/64768d	Inode: 419466      Links: 1
      Access: (0644/-rw-r--r--)  Uid: (419500/anovosel)   Gid: (18036/  beroza)
      Access: 2023-10-12 17:49:40.000000000 -0700
      Modify: 2023-10-11 15:45:05.000000000 -0700
      Change: 2023-10-30 04:12:08.000000000 -0700
       Birth: -
      
      [root@fir-md1-s1 Seisbench_DATA]# getfattr -m '.*' -d stead_mem.csv
      # file: stead_mem.csv
      trusted.link=0s3/HqEQEAAAA3AAAAAAAAAAAAAAAAAAAAAB8AAAACAAU7gAAAQQIAAAAAc3RlYWRfbWVtLmNzdg==
      trusted.lma=0sAAAAAAAAAADBnAUAAgAAABcFAAAAAAAA
      trusted.lov=0s0AvWC6ABAAAGAAAAAAADAAAAAAAAAAAAAAAAAAAAAAABAAAAEAAAAAAAAAAAAAAAAABAAAAAAACwAAAASAAAAAAAAABrAAAAAAAAAAAAAAACAAAAEAAAAAAAQAAAAAAAAAAAAAQAAAD4AAAAYAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAAAAAAAAAEAAAA//////////9YAQAASAAAAAAAAAAAAAAAAAAAAAAAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAQAAAHNzZAAAAAAAAAAAAAAAAACmTOEDAAAAAAAAAAAAAAAAAAAAAH0AAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAAgAAAGhkZAAQAP//AAAAAAAAAADJ+XgHAAAAAAAAAAAAAAAAAAAAAEoAAAAaNz8HAAAAAAAAAAAAAAAAAAAAAEsAAADQC9MLAQAAABcFAAAAAAAAwZwFAAIAAAAAAEAAEAD//2hkZAD/////IGiiAv////8AAAAAAAAAAAAAAAB1AAAAN8UpBv////8=
      trusted.projid="419500"
      trusted.som=0sBAAAAAAAAADUpL4WAAAAAGhfCwAAAAAA
      

       
      Out of tens of millions of files migrated like that in the last months, I could find only a few hundreds like this, so it's rare and appeared only recently with 2.15.3. We have to replace old storage chassis so won't have much time to troubleshoot, so let me know if you think of anything I could try. My current workaround for this problem is to make a copy + unlink the files manually instead.

      Note: the hdd pool (last component) only have OSTs with max_create_count=0 but this PFL setting is very common and worked on many other files.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: