Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-274

Client delayed file status (cache meta-data) causing job failures

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
    • None
    • servers running 1.8.5.54 and 1.8.5.55 both also have a patch to increase the number of MDS threads
      OS version CentOS 5.5
      Clients experiencing the problem are running 1.8.5 release from Oracle.
    • 3
    • 5021

    Description

      A site key user while running a weather forcasting model that as part of its post processing and gathering of information from lustre noticed that sometimes his job would fail or get errors from tar because of zero length files. Unpon further investigation it was noticed that you could see basically what was going on from a simple "ls -l" command. The first instance showed files with zero length. After a short period of time (10-15 seconds and it varies) repeating the "ls -l" would show the correct size.
      So the user basic sequence that is run is:
      1. Base data is in directory from completed run
      2. Execute a mppcombine program that takes this data and creates a set of joined output data
      3. create a list of output files from mppcombine run to feed to tar using "ls -l" of this newly created data
      4. Take that list and begin tarring up the data
      5. If you see "xt5-widow2-mpp-test.o406112:tar: atmos_scalar.nc: file changed as we read it" bingo you hit the problem.

      Needless to say this gives the impression of data corruption since the tar file can't be trusted. The data in reality is there but not appearant to tar. If you place a "sync" command between steps 2 and 3 you appear to be able to avoid the problem.However, this is not acceptable because since we can't be expected to tell all users they need to issue a sync before certain operations or they may not get all their data.

      An example of the ls -l output

      >> ls -ltr
      .
      .
      rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
      rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
      rw------ 1 XX.XXXX user 0 2011-04-12 22:01 time_FS.o564927
      rw------ 1 XX.XXXX user 0 2011-04-12 22:36 time_FS.o565015

      Wait an undetermined amount of time repeat:

      >> ls -ltr
      .
      .
      rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
      rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
      rw------ 1 XX.XXXX user 130991 2011-04-12 22:01 time_FS.o564927
      rw------ 1 XX.XXXX user 130956 2011-04-12 22:36 time_FS.o565015

      Data size is now correct.

      I have made a couple of attempts at a local test using lustre 1.8.4 and so far am unable to reproduce the problem but not convinced this is conclusive. Will be running more tests.

      On the Oracle bugzilla basically bug 24501 is the same problem and also searching their DB it appears that bug 24458 might also be basically the same issue.

      I will attach a test case that has the basic commands used. The actual test data is available but is extremely large and I would need to be pointed to a ftp for that.

      Attachments

        1. mppnccombine.c
          40 kB
        2. tar_fail.tar
          7.22 MB

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              woods Steven Woods
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: