[LU-8696] "ls" hangs on a particular directory on production system - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3, Lustre 2.8.0
Labels:
None
Environment:
OLCF Atlas production system: clients running 2.8.0+ (with patches), server running 2.5.5+ (with patches)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On atlas2 file system, we have a particular directory, any operations such as "ls" or "stat" will completely hang the process. This incurs no OS error or Lustre error from the client side. On server side, we did observe OI scrub message a few times, which may suggest there is some MDS data inconsistency, and it is "trying" to do the fix but no avail. We can't correlate the two yet.

Ops teams have collected traces on the client side by:

mount -t lustre 10.36.226.77@o2ib:/atlas2 /lustre/atlas2 -o rw,flock,nosuid,nodev
lctl set_param osc/*/checksums 0
echo “all” > /proc/sys/lnet/debug
echo “1024” > /proc/sys/lnet/debug_mb

Step2: cd /lustre/atlas2/path/to/offending_directory/
Step3: ls

Step1: lctl dk > /dev/null
Step4: Wait 30 seconds
Step5: lctl dk > atlas2-mds3_ls_for_fprof.out

the log is attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas2-mds3_ls_for_fprof.out.gz
2.41 MB
12/Oct/16 3:59 PM

Issue Links

is cloned by

LU-10237 "ls" hangs on a particular directory

Resolved

Activity

[LU-8696] "ls" hangs on a particular directory on production system

James A Simmons added a comment - 09/Mar/17 6:19 PM

lfsck run seems to have fixed this.

James A Simmons added a comment - 09/Mar/17 6:19 PM lfsck run seems to have fixed this.

James A Simmons added a comment - 28/Feb/17 4:55 PM

Since this is the case we will leave this open until the lfsck run.

James A Simmons added a comment - 28/Feb/17 4:55 PM Since this is the case we will leave this open until the lfsck run.

nasf (Inactive) added a comment - 28/Feb/17 4:13 PM - edited

Sorry, some misguide, "lctl set_param fail_loc=0x1505" is used for bypass FID-in-dirent when it is broken. So if your system hang before, but works well with "lctl set_param fail_loc=0x1505", that means it is quite possible that some FID-in-dirent is broken. Under such case, you need to run namespace LFSCK (with "lctl set_param fail_loc=0") to repair the FID-in-dirent. Otherwise, bypass FID-in-dirent will slowdown the lookup() performance.

nasf (Inactive) added a comment - 28/Feb/17 4:13 PM - edited Sorry, some misguide, "lctl set_param fail_loc=0x1505" is used for bypass FID-in-dirent when it is broken. So if your system hang before, but works well with "lctl set_param fail_loc=0x1505", that means it is quite possible that some FID-in-dirent is broken. Under such case, you need to run namespace LFSCK (with "lctl set_param fail_loc=0") to repair the FID-in-dirent. Otherwise, bypass FID-in-dirent will slowdown the lookup() performance.

James A Simmons added a comment - 28/Feb/17 3:48 PM

The command "lctl set_param fail_loc=0x1505" was run on the MDS and it fixed the problem. Thanks nasf

James A Simmons added a comment - 28/Feb/17 3:48 PM The command "lctl set_param fail_loc=0x1505" was run on the MDS and it fixed the problem. Thanks nasf

Dustin Leverman added a comment - 24/Jan/17 3:42 PM

Sorry for the delays Nasf! We will be having an outage on Feb. 07 to test lustre-2.8 servers and will hopefully leave it in production. After this outage we can run an online lfsck to see if this probably gets resolved.

Dustin Leverman added a comment - 24/Jan/17 3:42 PM Sorry for the delays Nasf! We will be having an outage on Feb. 07 to test lustre-2.8 servers and will hopefully leave it in production. After this outage we can run an online lfsck to see if this probably gets resolved.

nasf (Inactive) added a comment - 24/Jan/17 2:57 AM

Ping.

nasf (Inactive) added a comment - 24/Jan/17 2:57 AM Ping.

nasf (Inactive) added a comment - 26/Nov/16 8:21 AM

Dustin, do you have more logs or any feedback about trying "lctl set_param fail_loc=0x1505" on the MDS? Thanks!

nasf (Inactive) added a comment - 26/Nov/16 8:21 AM Dustin, do you have more logs or any feedback about trying "lctl set_param fail_loc=0x1505" on the MDS? Thanks!

nasf (Inactive) added a comment - 27/Oct/16 2:09 AM

Another possible reason is that the FID-in-dirent is corrupted, that can explain why the OI scrub was triggered but no inconsistent OI mapping was found. It can be verified via "lctl set_param fail_loc=0x1505" on the MDS and try "ls" again after the setting. If it still hung there, then it is NOT the case; otherwise, we found the reason.

nasf (Inactive) added a comment - 27/Oct/16 2:09 AM Another possible reason is that the FID-in-dirent is corrupted, that can explain why the OI scrub was triggered but no inconsistent OI mapping was found. It can be verified via "lctl set_param fail_loc=0x1505" on the MDS and try "ls" again after the setting. If it still hung there, then it is NOT the case; otherwise, we found the reason.

nasf (Inactive) added a comment - 25/Oct/16 2:39 PM

It is strange that the OI scrub has not found inconsistency. It should be some OI scrub issue.
Do you have the MDS side -1 level Lustre kernel debug logs when the "ls" hung? On the other hand, would you please to use "debugfs" to dump the directory and its sub-items that caused the system hung when "ls"? Thanks!

nasf (Inactive) added a comment - 25/Oct/16 2:39 PM It is strange that the OI scrub has not found inconsistency. It should be some OI scrub issue. Do you have the MDS side -1 level Lustre kernel debug logs when the "ls" hung? On the other hand, would you please to use "debugfs" to dump the directory and its sub-items that caused the system hung when "ls"? Thanks!

People

Assignee:: nasf (Inactive)

Reporter:: Feiyi Wang

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Oct/16 3:58 PM

Updated:: 13/Nov/17 11:08 PM

Resolved:: 09/Mar/17 6:19 PM