Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.3
    • None
    • Client: 2.4.3
      server: 2.4.3
    • 3
    • 9223372036854775807

    Description

      One of our filesystem is experiencing, what we guess are, short reads that results in NaNs using mpiio function call 'MPI_FILE_READ_AT_ALL'

      This can be reproduced every time if the data is read from disk and not cache. So doing a echo 1 > /proc/sys/vm/drop_caches then running the code will error every time, but running the code a second or third time will not produce the error.

      NOTE:
      This occurs only when the file is striped >1 ost.
      In the debug logs the datafile has a fid of [0x2000b2ebc:0x358:0x0]
      During the debugging I disabled read ahead

      I have captured a full debug trace of lustre on the client and will upload to tftp site.

      Attachments

        Issue Links

          Activity

            [LU-6545] MPIIO short reads
            pjones Peter Jones added a comment -

            As per NASA fix worked

            pjones Peter Jones added a comment - As per NASA fix worked

            I will upload the reproducer next week

            mhanafi Mahmoud Hanafi added a comment - I will upload the reproducer next week

            We tested 2.4.3 build and it has fixed the issue. Debugging showed that the read restarts are getting triggeted.

            We will test 2.5.3 client next week

            mhanafi Mahmoud Hanafi added a comment - We tested 2.4.3 build and it has fixed the issue. Debugging showed that the read restarts are getting triggeted. We will test 2.5.3 client next week
            pjones Peter Jones added a comment -

            As per the discussion on today's call with NASA they are going to try out the fix from LU-6389 to see if it fixes the issues exposed by their reproducer.

            pjones Peter Jones added a comment - As per the discussion on today's call with NASA they are going to try out the fix from LU-6389 to see if it fixes the issues exposed by their reproducer.

            Mahmoud - Are you able to share a reproducer publicly? It would be nice to have if so.

            paf Patrick Farrell (Inactive) added a comment - Mahmoud - Are you able to share a reproducer publicly? It would be nice to have if so.

            It is the same but we have a %100 reproducer

            mhanafi Mahmoud Hanafi added a comment - It is the same but we have a %100 reproducer
            jay Jinshan Xiong (Inactive) added a comment - Is this LU-6389 ?

            debug logs uploaded to /uploads/LU6545/r401i0n14.failure2.gz

            mhanafi Mahmoud Hanafi added a comment - debug logs uploaded to /uploads/LU6545/r401i0n14.failure2.gz

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: