Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.3.0
-
Lustre server 2.1.4 centos 6.3
Lustre clients 2.3.0 sles11sp1
-
2
-
7461
Description
After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.
Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.
The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
'reproducer1.scr.o1000145' is PBS output of the execution.
'1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.
The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.
Hi Jay Lan,
I already took a look at those files, and I need more detail information about, can you please turn on more debug options, especially LNET on the client and server side and collect it again? The most interesting thing is that even the clients lost connection to the MGS which is not involved in the IO path at all. If I guess it correctly, this is likely a LNET problem. But I'd like to make it clear before pointing my finger to others.
Do you know if f90 opens the file with O_APPEND, and "write(9999) 66" just writes 66 bytes to the file?
THank you.