Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: Lustre 2.15.6
Labels:
None
Environment:
EL9.5, 2.15.6+ clients; EL7.9 2.15.4+ servers

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When running multiple lfs find and lfs migrate in parallel against different files (well unless there is a defect in our tool, no duplicate lfs migrate should be occurring here), I'm seeing a significant number of unexpected failures from lfs migrate with ESTALE. Just using the same lfs migrate command later seems to work just fine:

Example of one error:

failed-oak-h05v17.log:'/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc'

# RC=116
# FID [0x2f80002c8e:0x1fd36:0x0]
# parent MDT=2

error: lfs migrate: /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc: cannot swap layout: Stale file handle

On the client (oak-h05v17), lustre debug reports a single hit with this FID:

00000080:00020000:5.0:1764639414.857890:0:2439796:0:(file.c:241:ll_close_inode_openhandle()) oak-clilmv-ffff9b58dedc7000: inode [0x2f80002c8e:0x1fd36:0x0] mdc close failed: rc = -116

No hit in the MDT logs though (without using debug levels).

Launching the lfs migrate command manually succeeds:

# lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 254
	obdidx		 objid		 objid		 group
	   254	       1492854	     0x16c776	  0x4280000402


# lfs migrate -D -n --copy=/oak /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc

# lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
  lcm_layout_gen:    3
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   549755813888
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 515
      lmm_objects:
      - 0: { l_ost_idx: 515, l_fid: [0x8440000400:0xe47f93:0x0] }


    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 549755813888
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  32
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

Is it possible that there is a race condition when many lfs migrate are running in parallel, invalidating some cache? Could this -116 related to ll_close_inode_openhandle() be a clue?

Attachments

Activity

People

Assignee:: WC Triage

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Dec/25 2:01 AM

Updated:: 03/Dec/25 7:01 AM