Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3062

Multiple clients writing to the same file caused mpi application to fail

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.3.0
    • Lustre server 2.1.4 centos 6.3
      Lustre clients 2.3.0 sles11sp1
    • 2
    • 7461

    Description

      After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.

      Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.

      The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
      'reproducer1.scr.o1000145' is PBS output of the execution.
      '1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.

      The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.

      Attachments

        1. 1000145.pbspl1.0.log.txt
          227 kB
        2. 1000145.pbspl1.0.log.txt.-pbs
          26 kB
        3. lu-3062-reproducer-logs.tgz
          0.2 kB
        4. nbp2-server-logs.LU-3062
          5 kB
        5. reproducer_debug_r311i1n10_log
          56 kB
        6. reproducer_debug_r311i1n9_log
          56 kB
        7. reproducer_full_debug_log
          2.01 MB
        8. reproducer_full_debug_xaa.bz2
          0.2 kB
        9. reproducer_full_debug_xab.bz2
          5.00 MB
        10. reproducer_full_debug_xac.bz2
          0.2 kB
        11. reproducer_full_debug_xad.bz2
          0.2 kB
        12. reproducer_full_debug_xae.bz2
          3.02 MB
        13. reproducer1.scr
          0.8 kB
        14. reproducer1.scr.o1000145
          4 kB
        15. reproducer2.scr
          0.8 kB

        Activity

          [LU-3062] Multiple clients writing to the same file caused mpi application to fail

          full debug turned on this time.. debug logs were over 500MB.. jira has only 10MB limit so i took the portion of the log that made sense, that shows the OST disconnect. let me know if you want the whole client logs and we can figure out how to get them to you.

          qm137 James Karellas (Inactive) added a comment - full debug turned on this time.. debug logs were over 500MB.. jira has only 10MB limit so i took the portion of the log that made sense, that shows the OST disconnect. let me know if you want the whole client logs and we can figure out how to get them to you.

          Is there an interop issue between 2.3.0 client and 2.1.4 server? Any change in 2.3.0 client requires same change at the server?

          jaylan Jay Lan (Inactive) added a comment - Is there an interop issue between 2.3.0 client and 2.1.4 server? Any change in 2.3.0 client requires same change at the server?

          Client debug logs. Server will be tougher to get. We may have to switch to our test filesystem to get that to work. Please look at client side logs and determine if you want me to still get server logs.

          qm137 James Karellas (Inactive) added a comment - Client debug logs. Server will be tougher to get. We may have to switch to our test filesystem to get that to work. Please look at client side logs and determine if you want me to still get server logs.

          original reproducer had a bug in it.

          qm137 James Karellas (Inactive) added a comment - original reproducer had a bug in it.

          I will try to reproduce the problem with increased debugging.

          The f90 program does not open with O_APPEND. All instances writes 12 bytes to the file. The content of the file:
          0000000 0004 0000 0042 0000 0004 0000
          contains three 4-byte word. The first and the last words probably are envelop. The second byte is the hex value of the number "66". Not 66 bytes. The program just write a number "66" to the output file.

          jaylan Jay Lan (Inactive) added a comment - I will try to reproduce the problem with increased debugging. The f90 program does not open with O_APPEND. All instances writes 12 bytes to the file. The content of the file: 0000000 0004 0000 0042 0000 0004 0000 contains three 4-byte word. The first and the last words probably are envelop. The second byte is the hex value of the number "66". Not 66 bytes. The program just write a number "66" to the output file.

          Hi Jay Lan,

          I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I'd like to make it clear before pointing my finger to others.

          Do you know if f90 opens the file with O_APPEND, and "write(9999) 66" just writes 66 bytes to the file?

          THank you.

          jay Jinshan Xiong (Inactive) added a comment - Hi Jay Lan, I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I'd like to make it clear before pointing my finger to others. Do you know if f90 opens the file with O_APPEND, and "write(9999) 66" just writes 66 bytes to the file? THank you.

          This is a shorten version of 1000145.pbspl1.0.log.txt with pbs_mom and epilogue messages removed.

          jaylan Jay Lan (Inactive) added a comment - This is a shorten version of 1000145.pbspl1.0.log.txt with pbs_mom and epilogue messages removed.

          Hi Jinshan,

          The client side logs are in 1000145.pbspl1.0.log.txt. You may want to filter out pbs information. The nbp2-server-logs.LU-3062 is the tarball of all server side logs:

          linux39.jlan 109> tar -tzf nbp2-server-logs.LU-3062
          service160
          service161
          service162
          service163
          service164
          service165
          service166
          service167
          service168

          Service160 is the mds/mgs, the rest are oss.

          jaylan Jay Lan (Inactive) added a comment - Hi Jinshan, The client side logs are in 1000145.pbspl1.0.log.txt. You may want to filter out pbs information. The nbp2-server-logs. LU-3062 is the tarball of all server side logs: linux39.jlan 109> tar -tzf nbp2-server-logs. LU-3062 service160 service161 service162 service163 service164 service165 service166 service167 service168 Service160 is the mds/mgs, the rest are oss.

          (pbspl1,241) od -x fort.9999
          0000000 0004 0000 0042 0000 0004 0000
          0000014
          (pbspl1,242) ls -l fort.9999
          rw-rr- 1 jlan g1099 12 Mar 28 18:39 fort.9999
          (pbspl1,243)

          It is 12 bytes.

          jaylan Jay Lan (Inactive) added a comment - (pbspl1,241) od -x fort.9999 0000000 0004 0000 0042 0000 0004 0000 0000014 (pbspl1,242) ls -l fort.9999 rw-r r - 1 jlan g1099 12 Mar 28 18:39 fort.9999 (pbspl1,243) It is 12 bytes.

          Hi Jay Lan,

          What does "write(9999) 66" mean in the reproducer? I mean how much data it will write to the file by this command.

          Can you please collect lustre logs on the client and server side while running the reproducer?

          jay Jinshan Xiong (Inactive) added a comment - Hi Jay Lan, What does "write(9999) 66" mean in the reproducer? I mean how much data it will write to the file by this command. Can you please collect lustre logs on the client and server side while running the reproducer?

          This tarball contains syslog between [Thu Mar 28 12:00:00 2013] and [Thu Mar 28 13:00:00 2013].

          service160 is mds/mgs. The rest are oss'es.

          jaylan Jay Lan (Inactive) added a comment - This tarball contains syslog between [Thu Mar 28 12:00:00 2013] and [Thu Mar 28 13:00:00 2013] . service160 is mds/mgs. The rest are oss'es.

          People

            green Oleg Drokin
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: