Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8696

"ls" hangs on a particular directory on production system

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • OLCF Atlas production system: clients running 2.8.0+ (with patches), server running 2.5.5+ (with patches)
    • 3
    • 9223372036854775807

    Description

      On atlas2 file system, we have a particular directory, any operations such as "ls" or "stat" will completely hang the process. This incurs no OS error or Lustre error from the client side. On server side, we did observe OI scrub message a few times, which may suggest there is some MDS data inconsistency, and it is "trying" to do the fix but no avail. We can't correlate the two yet.

      Ops teams have collected traces on the client side by:

      mount -t lustre 10.36.226.77@o2ib:/atlas2 /lustre/atlas2 -o rw,flock,nosuid,nodev
      lctl set_param osc/*/checksums 0
      echo β€œall” > /proc/sys/lnet/debug
      echo β€œ1024” > /proc/sys/lnet/debug_mb

      Step2: cd /lustre/atlas2/path/to/offending_directory/
      Step3: ls

      Step1: lctl dk > /dev/null
      Step4: Wait 30 seconds
      Step5: lctl dk > atlas2-mds3_ls_for_fprof.out

      the log is attached.

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              fwang2 Feiyi Wang
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: