Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19652

lfs migrate: cannot swap layout: Stale file handle

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.15.6
    • None
    • EL9.5, 2.15.6+ clients; EL7.9 2.15.4+ servers
    • 3
    • 9223372036854775807

    Description

      When running multiple lfs find and lfs migrate in parallel against different files (well unless there is a defect in our tool, no duplicate lfs migrate should be occurring here), I'm seeing a significant number of unexpected failures from lfs migrate with ESTALE. Just using the same lfs migrate command later seems to work just fine:

      Example of one error:

      failed-oak-h05v17.log:'/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc'
      
      # RC=116
      # FID [0x2f80002c8e:0x1fd36:0x0]
      # parent MDT=2
      
      error: lfs migrate: /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc: cannot swap layout: Stale file handle 

       

      On the client (oak-h05v17), lustre debug reports a single hit with this FID:

      00000080:00020000:5.0:1764639414.857890:0:2439796:0:(file.c:241:ll_close_inode_openhandle()) oak-clilmv-ffff9b58dedc7000: inode [0x2f80002c8e:0x1fd36:0x0] mdc close failed: rc = -116 

      No hit in the MDT logs though (without using debug levels).

       

      Launching the lfs migrate command manually succeeds:

      # lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
      /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 254
      	obdidx		 objid		 objid		 group
      	   254	       1492854	     0x16c776	  0x4280000402
      
      
      # lfs migrate -D -n --copy=/oak /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
      
      # lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
      /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
        lcm_layout_gen:    3
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             1
          lcme_mirror_id:      0
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   549755813888
            lmm_stripe_count:  1
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 515
            lmm_objects:
            - 0: { l_ost_idx: 515, l_fid: [0x8440000400:0xe47f93:0x0] }
      
      
          lcme_id:             2
          lcme_mirror_id:      0
          lcme_flags:          0
          lcme_extent.e_start: 549755813888
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  32
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: -1 

      Is it possible that there is a race condition when many lfs migrate are running in parallel, invalidating some cache? Could this -116 related to ll_close_inode_openhandle() be a clue?

      Attachments

        Activity

          People

            wc-triage WC Triage
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: