Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8696

"ls" hangs on a particular directory on production system

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • OLCF Atlas production system: clients running 2.8.0+ (with patches), server running 2.5.5+ (with patches)
    • 3
    • 9223372036854775807

    Description

      On atlas2 file system, we have a particular directory, any operations such as "ls" or "stat" will completely hang the process. This incurs no OS error or Lustre error from the client side. On server side, we did observe OI scrub message a few times, which may suggest there is some MDS data inconsistency, and it is "trying" to do the fix but no avail. We can't correlate the two yet.

      Ops teams have collected traces on the client side by:

      mount -t lustre 10.36.226.77@o2ib:/atlas2 /lustre/atlas2 -o rw,flock,nosuid,nodev
      lctl set_param osc/*/checksums 0
      echo β€œall” > /proc/sys/lnet/debug
      echo β€œ1024” > /proc/sys/lnet/debug_mb

      Step2: cd /lustre/atlas2/path/to/offending_directory/
      Step3: ls

      Step1: lctl dk > /dev/null
      Step4: Wait 30 seconds
      Step5: lctl dk > atlas2-mds3_ls_for_fprof.out

      the log is attached.

      Attachments

        Issue Links

          Activity

            [LU-8696] "ls" hangs on a particular directory on production system

            lfsck run seems to have fixed this.

            simmonsja James A Simmons added a comment - lfsck run seems to have fixed this.

            Since this is the case we will leave this open until the lfsck run.

            simmonsja James A Simmons added a comment - Since this is the case we will leave this open until the lfsck run.
            yong.fan nasf (Inactive) added a comment - - edited

            Sorry, some misguide, "lctl set_param fail_loc=0x1505" is used for bypass FID-in-dirent when it is broken. So if your system hang before, but works well with "lctl set_param fail_loc=0x1505", that means it is quite possible that some FID-in-dirent is broken. Under such case, you need to run namespace LFSCK (with "lctl set_param fail_loc=0") to repair the FID-in-dirent. Otherwise, bypass FID-in-dirent will slowdown the lookup() performance.

            yong.fan nasf (Inactive) added a comment - - edited Sorry, some misguide, "lctl set_param fail_loc=0x1505" is used for bypass FID-in-dirent when it is broken. So if your system hang before, but works well with "lctl set_param fail_loc=0x1505", that means it is quite possible that some FID-in-dirent is broken. Under such case, you need to run namespace LFSCK (with "lctl set_param fail_loc=0") to repair the FID-in-dirent. Otherwise, bypass FID-in-dirent will slowdown the lookup() performance.

            The command "lctl set_param fail_loc=0x1505" was run on the MDS and it fixed the problem. Thanks nasf

            simmonsja James A Simmons added a comment - The command "lctl set_param fail_loc=0x1505" was run on the MDS and it fixed the problem. Thanks nasf

            Sorry for the delays Nasf! We will be having an outage on Feb. 07 to test lustre-2.8 servers and will hopefully leave it in production. After this outage we can run an online lfsck to see if this probably gets resolved.

            dustb100 Dustin Leverman added a comment - Sorry for the delays Nasf! We will be having an outage on Feb. 07 to test lustre-2.8 servers and will hopefully leave it in production. After this outage we can run an online lfsck to see if this probably gets resolved.

            Ping.

            yong.fan nasf (Inactive) added a comment - Ping.

            Dustin, do you have more logs or any feedback about trying "lctl set_param fail_loc=0x1505" on the MDS? Thanks!

            yong.fan nasf (Inactive) added a comment - Dustin, do you have more logs or any feedback about trying "lctl set_param fail_loc=0x1505" on the MDS? Thanks!

            Another possible reason is that the FID-in-dirent is corrupted, that can explain why the OI scrub was triggered but no inconsistent OI mapping was found. It can be verified via "lctl set_param fail_loc=0x1505" on the MDS and try "ls" again after the setting. If it still hung there, then it is NOT the case; otherwise, we found the reason.

            yong.fan nasf (Inactive) added a comment - Another possible reason is that the FID-in-dirent is corrupted, that can explain why the OI scrub was triggered but no inconsistent OI mapping was found. It can be verified via "lctl set_param fail_loc=0x1505" on the MDS and try "ls" again after the setting. If it still hung there, then it is NOT the case; otherwise, we found the reason.

            It is strange that the OI scrub has not found inconsistency. It should be some OI scrub issue.
            Do you have the MDS side -1 level Lustre kernel debug logs when the "ls" hung? On the other hand, would you please to use "debugfs" to dump the directory and its sub-items that caused the system hung when "ls"? Thanks!

            yong.fan nasf (Inactive) added a comment - It is strange that the OI scrub has not found inconsistency. It should be some OI scrub issue. Do you have the MDS side -1 level Lustre kernel debug logs when the "ls" hung? On the other hand, would you please to use "debugfs" to dump the directory and its sub-items that caused the system hung when "ls"? Thanks!

            Nasf,
            Per Intel's recommendation, we ran an e2fsck during our last test shot to see if the problem gets fixed (despite the OI scrubber messages that we were seeing in the logs). We did find some non-critical issues, but we are still seeing the same hanging behavior with this directory. We have to take a downtime to temporarily upgrade to lustre-2.8 to use a functional LFSCK. I'm not 100% sure when we will get this opportunity, but I will keep it on our radar. For your reference, this is the IO scrub lfs get_param info you were wanting:

            [root@atlas2-mds1 mdt]# lctl get_param -n osd-ldiskfs.atlas2-MDT0000.oi_scrub
            name: OI_scrub
            magic: 0x4c5fd252
            oi_files: 64
            status: completed
            flags:
            param:
            time_since_last_completed: 559 seconds
            time_since_latest_start: 5295 seconds
            time_since_last_checkpoint: 559 seconds
            latest_start_position: 12
            last_checkpoint_position: 1073741825
            first_failure_position: N/A
            checked: 406401957
            updated: 0
            failed: 0
            prior_updated: 0
            noscrub: 192023
            igif: 158502
            success_count: 1140
            run_time: 4736 seconds
            average_speed: 85811 objects/sec
            real-time_speed: N/A
            current_position: N/A
            lf_scanned: 0
            lf_reparied: 0
            lf_failed: 0

            Thanks,
            Dustin

            dustb100 Dustin Leverman added a comment - Nasf, Per Intel's recommendation, we ran an e2fsck during our last test shot to see if the problem gets fixed (despite the OI scrubber messages that we were seeing in the logs). We did find some non-critical issues, but we are still seeing the same hanging behavior with this directory. We have to take a downtime to temporarily upgrade to lustre-2.8 to use a functional LFSCK. I'm not 100% sure when we will get this opportunity, but I will keep it on our radar. For your reference, this is the IO scrub lfs get_param info you were wanting: [root@atlas2-mds1 mdt] # lctl get_param -n osd-ldiskfs.atlas2-MDT0000.oi_scrub name: OI_scrub magic: 0x4c5fd252 oi_files: 64 status: completed flags: param: time_since_last_completed: 559 seconds time_since_latest_start: 5295 seconds time_since_last_checkpoint: 559 seconds latest_start_position: 12 last_checkpoint_position: 1073741825 first_failure_position: N/A checked: 406401957 updated: 0 failed: 0 prior_updated: 0 noscrub: 192023 igif: 158502 success_count: 1140 run_time: 4736 seconds average_speed: 85811 objects/sec real-time_speed: N/A current_position: N/A lf_scanned: 0 lf_reparied: 0 lf_failed: 0 Thanks, Dustin

            People

              yong.fan nasf (Inactive)
              fwang2 Feiyi Wang
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: