Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4714

Corruption detected when copying large number of files occasionally

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.3, Lustre 2.1.5, Lustre 2.1.6, Lustre 2.4.2
    • None
    • 3
    • 12956

    Description

      Hi,

      Customer site are using Lustre 2.1.5 with RHEL 6.3 client for next cluster, 2.1.3 with SLES 11 client for old cluster.

      This only happened in 2.1.5 with RHEL 6.3 client.

      At the first time, I had reported there were file corruption randomly generated from customer.
      Later drill down the situation, found corrpted files rarely when converting a 'GRIB' file on some nodes.
      And, it was identified there were no corruption on original files that was written. But 1 or more other nodes could not read 1 or more files.

      File corruption were not observed with dealing with small number - less than 100 - of files (usually the size of single file is around 55MiB), but larger the file numbers - upto 660 files, more corruption observed (checking with diff or md5sum after copying files).

      I found LU-4380, LU-3219, and suspect this would also related FIEMAP and recent coreutils behavior.

      I had tried to upgrade to 2.1.6 for both server and client. It was not help.
      Also tried to 2.4.2 for server and 2.1.5/2.1.6/2.1.3 for client. All were useless.

      I found disable server-side read cache helps to decrease the ratio of corruption.
      However that could not eliminate the corruption issue.

      What shall I do for next ?

      Attachments

        Activity

          People

            wc-triage WC Triage
            jhxlee Jay Lee
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: