Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.3, Lustre 2.8.0
-
None
-
OLCF Atlas production system: clients running 2.8.0+ (with patches), server running 2.5.5+ (with patches)
-
3
-
9223372036854775807
Description
On atlas2 file system, we have a particular directory, any operations such as "ls" or "stat" will completely hang the process. This incurs no OS error or Lustre error from the client side. On server side, we did observe OI scrub message a few times, which may suggest there is some MDS data inconsistency, and it is "trying" to do the fix but no avail. We can't correlate the two yet.
Ops teams have collected traces on the client side by:
mount -t lustre 10.36.226.77@o2ib:/atlas2 /lustre/atlas2 -o rw,flock,nosuid,nodev
lctl set_param osc/*/checksums 0
echo βallβ > /proc/sys/lnet/debug
echo β1024β > /proc/sys/lnet/debug_mb
Step2: cd /lustre/atlas2/path/to/offending_directory/
Step3: ls
Step1: lctl dk > /dev/null
Step4: Wait 30 seconds
Step5: lctl dk > atlas2-mds3_ls_for_fprof.out
the log is attached.
Attachments
Issue Links
- Clones
-
LU-8696 "ls" hangs on a particular directory on production system
-
- Resolved
-
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30903/
Subject:
LU-10237mdc: interruptable during RPC retry for EINPROGRESSProject: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 822d5ce80dd357b53c0414cc299fadef0db076d1