[LU-10237] "ls" hangs on a particular directory Created: 13/Nov/17  Updated: 09/Feb/18  Resolved: 14/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3, Lustre 2.8.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Andreas Dilger Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

OLCF Atlas production system: clients running 2.8.0+ (with patches), server running 2.5.5+ (with patches)


Issue Links:
Cloners
Clones LU-8696 "ls" hangs on a particular directory ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On atlas2 file system, we have a particular directory, any operations such as "ls" or "stat" will completely hang the process. This incurs no OS error or Lustre error from the client side. On server side, we did observe OI scrub message a few times, which may suggest there is some MDS data inconsistency, and it is "trying" to do the fix but no avail. We can't correlate the two yet.

Ops teams have collected traces on the client side by:

mount -t lustre 10.36.226.77@o2ib:/atlas2 /lustre/atlas2 -o rw,flock,nosuid,nodev
lctl set_param osc/*/checksums 0
echo β€œall” > /proc/sys/lnet/debug
echo β€œ1024” > /proc/sys/lnet/debug_mb

Step2: cd /lustre/atlas2/path/to/offending_directory/
Step3: ls

Step1: lctl dk > /dev/null
Step4: Wait 30 seconds
Step5: lctl dk > atlas2-mds3_ls_for_fprof.out

the log is attached.



 Comments   
Comment by Andreas Dilger [ 13/Nov/17 ]

While LU-8696 fixed the actual problem of the MDT inconsistency, it would also be useful fix the client-side handling of this error, so that the userspace process could be interrupted if there is a problem.

Comment by Gerrit Updater [ 19/Nov/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30166
Subject: LU-10237 mdc: interruptable during RPC retry for EINPROGRESS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0a84ebd38c71747b44bad7a1c00ee39f4b7ff759

Comment by Gerrit Updater [ 14/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30166/
Subject: LU-10237 mdc: interruptable during RPC retry for EINPROGRESS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9c596a4996ee242aa1b954f5f2f19101d3941bf0

Comment by Peter Jones [ 14/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 17/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30903
Subject: LU-10237 mdc: interruptable during RPC retry for EINPROGRESS
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 588cb51b4a26cd07c036ee68451bc151e7eb73bd

Comment by Gerrit Updater [ 09/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30903/
Subject: LU-10237 mdc: interruptable during RPC retry for EINPROGRESS
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 822d5ce80dd357b53c0414cc299fadef0db076d1

Generated at Sat Feb 10 02:33:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.