Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.15.6
-
None
-
EL9.5, 2.15.6+ clients; EL7.9 2.15.4+ servers
-
3
-
9223372036854775807
Description
When running multiple lfs find and lfs migrate in parallel against different files (well unless there is a defect in our tool, no duplicate lfs migrate should be occurring here), I'm seeing a significant number of unexpected failures from lfs migrate with ESTALE. Just using the same lfs migrate command later seems to work just fine:
Example of one error:
failed-oak-h05v17.log:'/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc' # RC=116 # FID [0x2f80002c8e:0x1fd36:0x0] # parent MDT=2 error: lfs migrate: /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc: cannot swap layout: Stale file handle
On the client (oak-h05v17), lustre debug reports a single hit with this FID:
00000080:00020000:5.0:1764639414.857890:0:2439796:0:(file.c:241:ll_close_inode_openhandle()) oak-clilmv-ffff9b58dedc7000: inode [0x2f80002c8e:0x1fd36:0x0] mdc close failed: rc = -116
No hit in the MDT logs though (without using debug levels).
Launching the lfs migrate command manually succeeds:
# lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 254
obdidx objid objid group
254 1492854 0x16c776 0x4280000402
# lfs migrate -D -n --copy=/oak /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
# lfs getstripe /oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
/oak/.../Lib5_CKDL200148350-1a-5_H7HGYCCX2_L8_1.nodup.pr2_x_ctl_for_rep2.300K.bfilt.num_peak.qc
lcm_layout_gen: 3
lcm_mirror_count: 1
lcm_entry_count: 2
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 549755813888
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 515
lmm_objects:
- 0: { l_ost_idx: 515, l_fid: [0x8440000400:0xe47f93:0x0] }
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 549755813888
lcme_extent.e_end: EOF
lmm_stripe_count: 32
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
Is it possible that there is a race condition when many lfs migrate are running in parallel, invalidating some cache? Could this -116 related to ll_close_inode_openhandle() be a clue?