[LU-3062] Multiple clients writing to the same file caused mpi application to fail - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.3.0
Labels:
- ptr
Environment:
Lustre server 2.1.4 centos 6.3
Lustre clients 2.3.0 sles11sp1

Severity:
2
Rank (Obsolete):
7461

Description

After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.

Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.

The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
'reproducer1.scr.o1000145' is PBS output of the execution.
'1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.

The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

1000145.pbspl1.0.log.txt
227 kB
29/Mar/13 12:54 AM
1000145.pbspl1.0.log.txt.-pbs
26 kB
01/Apr/13 6:40 PM
lu-3062-reproducer-logs.tgz
0.2 kB
21/Jun/13 8:03 PM
nbp2-server-logs.LU-3062
5 kB
29/Mar/13 7:11 PM
reproducer_debug_r311i1n10_log
56 kB
01/Apr/13 9:08 PM
reproducer_debug_r311i1n9_log
56 kB
01/Apr/13 9:08 PM
reproducer_full_debug_log
2.01 MB
01/Apr/13 10:23 PM
reproducer_full_debug_xaa.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xab.bz2
5.00 MB
02/Apr/13 6:30 PM
reproducer_full_debug_xac.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xad.bz2
0.2 kB
02/Apr/13 6:30 PM
reproducer_full_debug_xae.bz2
3.02 MB
02/Apr/13 6:30 PM
reproducer1.scr
0.8 kB
29/Mar/13 12:54 AM
reproducer1.scr.o1000145
4 kB
29/Mar/13 12:54 AM
reproducer2.scr
0.8 kB
01/Apr/13 8:02 PM

Activity

[LU-3062] Multiple clients writing to the same file caused mpi application to fail

Jay Lan (Inactive) added a comment - 01/Apr/13 5:47 PM

(pbspl1,241) od -x fort.9999
0000000 0004 0000 0042 0000 0004 0000
0000014
(pbspl1,242) ls -l fort.9999
~~rw-r~~r- 1 jlan g1099 12 Mar 28 18:39 fort.9999
(pbspl1,243)

It is 12 bytes.

Jay Lan (Inactive) added a comment - 01/Apr/13 5:47 PM (pbspl1,241) od -x fort.9999 0000000 0004 0000 0042 0000 0004 0000 0000014 (pbspl1,242) ls -l fort.9999 rw-r r - 1 jlan g1099 12 Mar 28 18:39 fort.9999 (pbspl1,243) It is 12 bytes.

Jinshan Xiong (Inactive) added a comment - 01/Apr/13 5:21 PM

Hi Jay Lan,

What does "write(9999) 66" mean in the reproducer? I mean how much data it will write to the file by this command.

Can you please collect lustre logs on the client and server side while running the reproducer?

Jinshan Xiong (Inactive) added a comment - 01/Apr/13 5:21 PM Hi Jay Lan, What does "write(9999) 66" mean in the reproducer? I mean how much data it will write to the file by this command. Can you please collect lustre logs on the client and server side while running the reproducer?

Jay Lan (Inactive) added a comment - 29/Mar/13 7:11 PM

This tarball contains syslog between [Thu Mar 28 12:00:00 2013] and [Thu Mar 28 13:00:00 2013].

service160 is mds/mgs. The rest are oss'es.

Jay Lan (Inactive) added a comment - 29/Mar/13 7:11 PM This tarball contains syslog between [Thu Mar 28 12:00:00 2013] and [Thu Mar 28 13:00:00 2013] . service160 is mds/mgs. The rest are oss'es.

Jay Lan (Inactive) added a comment - 29/Mar/13 6:34 PM

The PCIE corrected errors seem to be related to sandy bridge PCI 3.0. We have seen 10,000s of those errors a day. However, same applications did not fail if run 2.1.3 clients.

Can you identify those AST timeouts patches? Are they client side? Note that we run 2.1.4 at servers. Is there any issue with 2.1.4 server + 2.3.0 client combination?

Jay Lan (Inactive) added a comment - 29/Mar/13 6:34 PM The PCIE corrected errors seem to be related to sandy bridge PCI 3.0. We have seen 10,000s of those errors a day. However, same applications did not fail if run 2.1.3 clients. Can you identify those AST timeouts patches? Are they client side? Note that we run 2.1.4 at servers. Is there any issue with 2.1.4 server + 2.3.0 client combination?

Jay Lan (Inactive) added a comment - 29/Mar/13 6:20 PM

About the "PCIE Bus Error: severity=Corrected" errors, I checked with our admins. He said it was normal and not indicative of an IB problem.

I will collect and provide logs at the servers.

BTW, the test was run using 1024 cpus distributed to 64 nodes. However, I was able to reproduce the problem with only 4 sandy bridge nodes, 4*16=64 processes.

Jay Lan (Inactive) added a comment - 29/Mar/13 6:20 PM About the "PCIE Bus Error: severity=Corrected" errors, I checked with our admins. He said it was normal and not indicative of an IB problem. I will collect and provide logs at the servers. BTW, the test was run using 1024 cpus distributed to 64 nodes. However, I was able to reproduce the problem with only 4 sandy bridge nodes, 4*16=64 processes.

Oleg Drokin added a comment - 29/Mar/13 5:52 AM

Ok, so from the logs we can see the client was evicted by server for some reason.
Now why it was evicted is not clear because there seems to be no server logs included, but I imagine it's due to AST timeouts. We included multiple patches in 2.4 to help this cause.

In addition to that I cannot stop wondering about this message:

Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266405] pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266572] pcieport 0000:00:02.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.276986] pcieport 0000:00:02.0:   device [8086:3c04] error status/mask=00000040/00002000
Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.285434] pcieport 0000:00:02.0:    [ 6] Bad TLP

hopefully it did not result in any dropped messages.

Oleg Drokin added a comment - 29/Mar/13 5:52 AM Ok, so from the logs we can see the client was evicted by server for some reason. Now why it was evicted is not clear because there seems to be no server logs included, but I imagine it's due to AST timeouts. We included multiple patches in 2.4 to help this cause. In addition to that I cannot stop wondering about this message: Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266405] pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010 Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.266572] pcieport 0000:00:02.0: PCIE Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID) Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.276986] pcieport 0000:00:02.0: device [8086:3c04] error status/mask=00000040/00002000 Thu Mar 28 12:43:49 2013 M r325i4n13 kernel: [569506.285434] pcieport 0000:00:02.0: [ 6] Bad TLP hopefully it did not result in any dropped messages.

People

Assignee:: Oleg Drokin

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 29/Mar/13 12:54 AM

Updated:: 08/Sep/16 9:33 PM

Resolved:: 08/Sep/16 9:33 PM