[LU-3062] Multiple clients writing to the same file caused mpi application to fail - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.3.0
Labels:
- ptr
Environment:
Lustre server 2.1.4 centos 6.3
Lustre clients 2.3.0 sles11sp1

Severity:
2
Rank (Obsolete):
7461

Description

After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.

Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.

The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
'reproducer1.scr.o1000145' is PBS output of the execution.
'1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.

The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1000145.pbspl1.0.log.txt
227 kB
29/Mar/13 12:54 AM
1000145.pbspl1.0.log.txt.-pbs
26 kB
01/Apr/13 6:40 PM
lu-3062-reproducer-logs.tgz
0.2 kB
21/Jun/13 8:03 PM
nbp2-server-logs.LU-3062
5 kB
29/Mar/13 7:11 PM
reproducer_debug_r311i1n10_log
56 kB
01/Apr/13 9:08 PM
reproducer_debug_r311i1n9_log
56 kB
01/Apr/13 9:08 PM
reproducer_full_debug_log
2.01 MB
01/Apr/13 10:23 PM
reproducer_full_debug_xaa.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xab.bz2
5.00 MB
02/Apr/13 6:30 PM
reproducer_full_debug_xac.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xad.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xae.bz2
3.02 MB
02/Apr/13 6:30 PM
reproducer1.scr
0.8 kB
29/Mar/13 12:54 AM
reproducer1.scr.o1000145
4 kB
29/Mar/13 12:54 AM
reproducer2.scr
0.8 kB
01/Apr/13 8:02 PM

Activity

[LU-3062] Multiple clients writing to the same file caused mpi application to fail

James Karellas (Inactive) added a comment - 20/Jun/13 12:47 AM

Sent Andreas email on 18th, but didn't add to Jira.

Hey Andreas,

Our position is that users doing this type of work should
not cause an eviction. We agree that it is sub-optimal
at best (we have a different term for it: stupid), but
our users continue to do it. There are various reasons
why users can't/won't change their code here at NASA.
Word from management is that we need to get this fixed.
I've copied our local Lustre team in case anyone has
anything else to add.

Thanks,

jdk

James Karellas (Inactive) added a comment - 20/Jun/13 12:47 AM Sent Andreas email on 18th, but didn't add to Jira. Hey Andreas, Our position is that users doing this type of work should not cause an eviction. We agree that it is sub-optimal at best (we have a different term for it: stupid), but our users continue to do it. There are various reasons why users can't/won't change their code here at NASA. Word from management is that we need to get this fixed. I've copied our local Lustre team in case anyone has anything else to add. Thanks, jdk

Andreas Dilger added a comment - 14/Jun/13 10:38 PM

So, just to clarify, the problem here is that the reproducer program is starting 1024 tasks to write 12 bytes to the same offset=0 of the same file (striped over 16 OSTs?), and there is a lot of contention? Or am I misunderstanding and each thread will write to non-overlapping ranges of the file (i.e. like O_APPEND)?

This isn't terribly surprising, because either case is a pathologically bad IO pattern. If they are all writing to the same offset it is completely serialized by the locking, while using O_APPEND actually gets worse with increasing numbers of stripes, since it needs to lock all stripes to get the current file size.

Do you have any idea what the application is actually trying to accomplish with these overlapping writes? Is there any chance to modify the application to do (whatever it is trying to do) in a more sensible manner? Depending on what the application is actually trying to accomplish, there may be many more filesystem-friendly ways of doing this.

The servers should definitely not fail in this case, though I can imagine that the clients might time out waiting for their chance to overwrite the same bytes again. The clients should reconnect and complete the writes, however.

It might be possible to optimize pathological cases like this by using OST-side locking for the RPCs, though there is still a difficulty with sending sub-page writes that also need to be handled.

Andreas Dilger added a comment - 14/Jun/13 10:38 PM So, just to clarify, the problem here is that the reproducer program is starting 1024 tasks to write 12 bytes to the same offset=0 of the same file (striped over 16 OSTs?), and there is a lot of contention? Or am I misunderstanding and each thread will write to non-overlapping ranges of the file (i.e. like O_APPEND)? This isn't terribly surprising, because either case is a pathologically bad IO pattern. If they are all writing to the same offset it is completely serialized by the locking, while using O_APPEND actually gets worse with increasing numbers of stripes, since it needs to lock all stripes to get the current file size. Do you have any idea what the application is actually trying to accomplish with these overlapping writes? Is there any chance to modify the application to do (whatever it is trying to do) in a more sensible manner? Depending on what the application is actually trying to accomplish, there may be many more filesystem-friendly ways of doing this. The servers should definitely not fail in this case, though I can imagine that the clients might time out waiting for their chance to overwrite the same bytes again. The clients should reconnect and complete the writes, however. It might be possible to optimize pathological cases like this by using OST-side locking for the RPCs, though there is still a difficulty with sending sub-page writes that also need to be handled.

Bobbie Lind (Inactive) added a comment - 06/Jun/13 4:18 PM

I have received my account on Rosso and expect to complete the testing over the coming week.

Bobbie Lind (Inactive) added a comment - 06/Jun/13 4:18 PM I have received my account on Rosso and expect to complete the testing over the coming week.

Peter Jones added a comment - 30/May/13 11:41 PM

Bobbie

Could you please setup the reproducer supplied on April 1st above.

Thanks

Peter

Peter Jones added a comment - 30/May/13 11:41 PM Bobbie Could you please setup the reproducer supplied on April 1st above. Thanks Peter

James Karellas (Inactive) added a comment - 02/Apr/13 6:30 PM

Uploading full debug log file. Split file first, then compressed each segment using bzip2 to get below the max file size of 10MB.
Order of files is xaa, xab, xac, xad, xae.

James Karellas (Inactive) added a comment - 02/Apr/13 6:30 PM Uploading full debug log file. Split file first, then compressed each segment using bzip2 to get below the max file size of 10MB. Order of files is xaa, xab, xac, xad, xae.

James Karellas (Inactive) added a comment - 01/Apr/13 10:23 PM

full debug turned on this time.. debug logs were over 500MB.. jira has only 10MB limit so i took the portion of the log that made sense, that shows the OST disconnect. let me know if you want the whole client logs and we can figure out how to get them to you.

James Karellas (Inactive) added a comment - 01/Apr/13 10:23 PM full debug turned on this time.. debug logs were over 500MB.. jira has only 10MB limit so i took the portion of the log that made sense, that shows the OST disconnect. let me know if you want the whole client logs and we can figure out how to get them to you.

Jay Lan (Inactive) added a comment - 01/Apr/13 9:30 PM

Is there an interop issue between 2.3.0 client and 2.1.4 server? Any change in 2.3.0 client requires same change at the server?

Jay Lan (Inactive) added a comment - 01/Apr/13 9:30 PM Is there an interop issue between 2.3.0 client and 2.1.4 server? Any change in 2.3.0 client requires same change at the server?

James Karellas (Inactive) added a comment - 01/Apr/13 9:08 PM

Client debug logs. Server will be tougher to get. We may have to switch to our test filesystem to get that to work. Please look at client side logs and determine if you want me to still get server logs.

James Karellas (Inactive) added a comment - 01/Apr/13 9:08 PM Client debug logs. Server will be tougher to get. We may have to switch to our test filesystem to get that to work. Please look at client side logs and determine if you want me to still get server logs.

James Karellas (Inactive) added a comment - 01/Apr/13 8:02 PM

original reproducer had a bug in it.

James Karellas (Inactive) added a comment - 01/Apr/13 8:02 PM original reproducer had a bug in it.

Jay Lan (Inactive) added a comment - 01/Apr/13 7:00 PM

I will try to reproduce the problem with increased debugging.

The f90 program does not open with O_APPEND. All instances writes 12 bytes to the file. The content of the file:
0000000 0004 0000 0042 0000 0004 0000
contains three 4-byte word. The first and the last words probably are envelop. The second byte is the hex value of the number "66". Not 66 bytes. The program just write a number "66" to the output file.

Jay Lan (Inactive) added a comment - 01/Apr/13 7:00 PM I will try to reproduce the problem with increased debugging. The f90 program does not open with O_APPEND. All instances writes 12 bytes to the file. The content of the file: 0000000 0004 0000 0042 0000 0004 0000 contains three 4-byte word. The first and the last words probably are envelop. The second byte is the hex value of the number "66". Not 66 bytes. The program just write a number "66" to the output file.

Jinshan Xiong (Inactive) added a comment - 01/Apr/13 6:40 PM

Hi Jay Lan,

I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I'd like to make it clear before pointing my finger to others.

Do you know if f90 opens the file with O_APPEND, and "write(9999) 66" just writes 66 bytes to the file?

THank you.

Jinshan Xiong (Inactive) added a comment - 01/Apr/13 6:40 PM Hi Jay Lan, I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I'd like to make it clear before pointing my finger to others. Do you know if f90 opens the file with O_APPEND, and "write(9999) 66" just writes 66 bytes to the file? THank you.

People

Assignee:: Oleg Drokin

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 29/Mar/13 12:54 AM

Updated:: 08/Sep/16 9:33 PM

Resolved:: 08/Sep/16 9:33 PM