Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.3.0
-
Lustre server 2.1.4 centos 6.3
Lustre clients 2.3.0 sles11sp1
-
2
-
7461
Description
After we upgraded our clients from 2.1.3 to 2.3.0, some users (the crowd is increasing) started seeing their application to fail, to hang, or even crash. The servers run 2.1.4. In all cases, same application ran OK with 2.1.3.
Since we do not have reproducer on the hang and the crash cases, we here attach a reproducer that can cause application to fail. The test were executed with stripe count of 1, 2, 4, 8, 16. The higher number the stripe count the more likely application fails.
The 'reproducer1.scr' is a PBS script to start 1024 mpi tests.
'reproducer1.scr.o1000145' is PBS output of the execution.
'1000145.pbspl1.0.log.txt' is an output of one of our tools to collect /var/log/messages from servers and clients related to the specified job.
The PBS specific argument lines start with "#PBS " string and are ignored if executed without PBS. The script use SGI MPT, but can be converted to openmpi or intel mpi.
Hi Jinshan,
The client side logs are in 1000145.pbspl1.0.log.txt. You may want to filter out pbs information. The nbp2-server-logs.
LU-3062is the tarball of all server side logs:linux39.jlan 109> tar -tzf nbp2-server-logs.
LU-3062service160
service161
service162
service163
service164
service165
service166
service167
service168
Service160 is the mds/mgs, the rest are oss.