Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3062

Multiple clients writing to the same file caused mpi application to fail

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.3.0
    • Lustre server 2.1.4 centos 6.3
      Lustre clients 2.3.0 sles11sp1
    • 2
    • 7461

    Description

      After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.

      Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.

      The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
      'reproducer1.scr.o1000145' is PBS output of the execution.
      '1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.

      The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.

      Attachments

        1. 1000145.pbspl1.0.log.txt
          227 kB
        2. 1000145.pbspl1.0.log.txt.-pbs
          26 kB
        3. lu-3062-reproducer-logs.tgz
          0.2 kB
        4. nbp2-server-logs.LU-3062
          5 kB
        5. reproducer_debug_r311i1n10_log
          56 kB
        6. reproducer_debug_r311i1n9_log
          56 kB
        7. reproducer_full_debug_log
          2.01 MB
        8. reproducer_full_debug_xaa.bz2
          0.2 kB
        9. reproducer_full_debug_xab.bz2
          5.00 MB
        10. reproducer_full_debug_xac.bz2
          0.2 kB
        11. reproducer_full_debug_xad.bz2
          0.2 kB
        12. reproducer_full_debug_xae.bz2
          3.02 MB
        13. reproducer1.scr
          0.8 kB
        14. reproducer1.scr.o1000145
          4 kB
        15. reproducer2.scr
          0.8 kB

        Activity

          [LU-3062] Multiple clients writing to the same file caused mpi application to fail
          pjones Peter Jones added a comment -

          ok Jay

          pjones Peter Jones added a comment - ok Jay

          We have not seen this for a long while and we are running 2.7, so please close it.

          jaylan Jay Lan (Inactive) added a comment - We have not seen this for a long while and we are running 2.7, so please close it.

          Jim Karellas tested 2.4 client (against 2.1.5 server) before and he was
          still able to reproduce the problem with ease. Since 2.4 contains the http://review.whamcloud.com/5208 patch Jinshan mentioned and wanted us to test, I think we need to feed back the test result.

          jaylan Jay Lan (Inactive) added a comment - Jim Karellas tested 2.4 client (against 2.1.5 server) before and he was still able to reproduce the problem with ease. Since 2.4 contains the http://review.whamcloud.com/5208 patch Jinshan mentioned and wanted us to test, I think we need to feed back the test result.

          After taking a further look, this should be a problem about truncate because the test program was opening the file with O_TRUNC. This is why more stripes leads to more likely causing the problem.

          Can you please try this patch: http://review.whamcloud.com/5208 to see if it can help?

          jay Jinshan Xiong (Inactive) added a comment - After taking a further look, this should be a problem about truncate because the test program was opening the file with O_TRUNC. This is why more stripes leads to more likely causing the problem. Can you please try this patch: http://review.whamcloud.com/5208 to see if it can help?

          It seems that this is due to a deadlock on the client side.

          Is it possible to do an experiment as follows:
          1. write 66 to the shared file as usual;
          2. if the write can't be finished in 5 minutes(this is only true on NASA cluster because the lock timeout was 1900 seconds; on our local site, we should make it shorter, say 60 seconds), the script will dump the stack trace of processes running on the node by 'echo t > /proc/sysrq-trigger'.

          jay Jinshan Xiong (Inactive) added a comment - It seems that this is due to a deadlock on the client side. Is it possible to do an experiment as follows: 1. write 66 to the shared file as usual; 2. if the write can't be finished in 5 minutes(this is only true on NASA cluster because the lock timeout was 1900 seconds; on our local site, we should make it shorter, say 60 seconds), the script will dump the stack trace of processes running on the node by 'echo t > /proc/sysrq-trigger'.

          I'm attaching the logs from running the reproducer on the opensfs cluster. It was a 16 node, 1mds, 2oss test.

          bobbielind Bobbie Lind (Inactive) added a comment - I'm attaching the logs from running the reproducer on the opensfs cluster. It was a 16 node, 1mds, 2oss test.

          Sent Andreas email on 18th, but didn't add to Jira.

          Hey Andreas,

          Our position is that users doing this type of work should
          not cause an eviction. We agree that it is sub-optimal
          at best (we have a different term for it: stupid), but
          our users continue to do it. There are various reasons
          why users can't/won't change their code here at NASA.
          Word from management is that we need to get this fixed.
          I've copied our local Lustre team in case anyone has
          anything else to add.

          Thanks,

          jdk

          qm137 James Karellas (Inactive) added a comment - Sent Andreas email on 18th, but didn't add to Jira. Hey Andreas, Our position is that users doing this type of work should not cause an eviction. We agree that it is sub-optimal at best (we have a different term for it: stupid), but our users continue to do it. There are various reasons why users can't/won't change their code here at NASA. Word from management is that we need to get this fixed. I've copied our local Lustre team in case anyone has anything else to add. Thanks, jdk

          So, just to clarify, the problem here is that the reproducer program is starting 1024 tasks to write 12 bytes to the same offset=0 of the same file (striped over 16 OSTs?), and there is a lot of contention? Or am I misunderstanding and each thread will write to non-overlapping ranges of the file (i.e. like O_APPEND)?

          This isn't terribly surprising, because either case is a pathologically bad IO pattern. If they are all writing to the same offset it is completely serialized by the locking, while using O_APPEND actually gets worse with increasing numbers of stripes, since it needs to lock all stripes to get the current file size.

          Do you have any idea what the application is actually trying to accomplish with these overlapping writes? Is there any chance to modify the application to do (whatever it is trying to do) in a more sensible manner? Depending on what the application is actually trying to accomplish, there may be many more filesystem-friendly ways of doing this.

          The servers should definitely not fail in this case, though I can imagine that the clients might time out waiting for their chance to overwrite the same bytes again. The clients should reconnect and complete the writes, however.

          It might be possible to optimize pathological cases like this by using OST-side locking for the RPCs, though there is still a difficulty with sending sub-page writes that also need to be handled.

          adilger Andreas Dilger added a comment - So, just to clarify, the problem here is that the reproducer program is starting 1024 tasks to write 12 bytes to the same offset=0 of the same file (striped over 16 OSTs?), and there is a lot of contention? Or am I misunderstanding and each thread will write to non-overlapping ranges of the file (i.e. like O_APPEND)? This isn't terribly surprising, because either case is a pathologically bad IO pattern. If they are all writing to the same offset it is completely serialized by the locking, while using O_APPEND actually gets worse with increasing numbers of stripes, since it needs to lock all stripes to get the current file size. Do you have any idea what the application is actually trying to accomplish with these overlapping writes? Is there any chance to modify the application to do (whatever it is trying to do) in a more sensible manner? Depending on what the application is actually trying to accomplish, there may be many more filesystem-friendly ways of doing this. The servers should definitely not fail in this case, though I can imagine that the clients might time out waiting for their chance to overwrite the same bytes again. The clients should reconnect and complete the writes, however. It might be possible to optimize pathological cases like this by using OST-side locking for the RPCs, though there is still a difficulty with sending sub-page writes that also need to be handled.

          I have received my account on Rosso and expect to complete the testing over the coming week.

          bobbielind Bobbie Lind (Inactive) added a comment - I have received my account on Rosso and expect to complete the testing over the coming week.
          pjones Peter Jones added a comment -

          Bobbie

          Could you please setup the reproducer supplied on April 1st above.

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobbie Could you please setup the reproducer supplied on April 1st above. Thanks Peter

          People

            green Oleg Drokin
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: