[LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted Created: 23/Aug/12 Updated: 12/Sep/12 Resolved: 12/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | releases | ||
| Environment: |
server/client: lustre-b2_3/build #1/RHEL6 |
||
| Severity: | 3 |
| Rank (Obsolete): | 2219 |
| Description |
|
I think this is a MPI related issue [[55105,1],10]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: client-10vm1.lab.whamcloud.com Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found CMA: no RDMA devices found r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1345680120: No such file or directory r= 0 l=0000: WR A 203927/0x031c97, AP a 157830/0x026886, TR@ 308317/0x04b45d [client-10vm1.lab.whamcloud.com:12004] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [client-10vm1.lab.whamcloud.com:12004] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages r= 0 l=1000: WR M 391981/0x05fb2d, AP m 363671/0x058c97, TR@ 495994/0x07917a |
| Comments |
| Comment by Minh Diep [ 10/Sep/12 ] |
|
are there more logs, dmesg, console? The test should not fail even with these message |
| Comment by Chris Gearing (Inactive) [ 10/Sep/12 ] |
|
From Minh's comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device. It is very hard to be sure because no link to failing tests was provided, it's not even possible to know which test was running. |
| Comment by Minh Diep [ 10/Sep/12 ] |
|
the 'lower performance' can be ignored: http://cac.engin.umich.edu/faq.html |
| Comment by Minh Diep [ 10/Sep/12 ] |
|
what has changed was: before: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' " today: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt 'f%%d' " we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue. If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options. |
| Comment by Minh Diep [ 10/Sep/12 ] |
|
this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c client show that mdsrate was hung 20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages perhaps it related to this Oops on the mds 23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh |
| Comment by Andreas Dilger [ 10/Sep/12 ] |
|
Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is. |
| Comment by Peter Jones [ 10/Sep/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 11/Sep/12 ] |
|
I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops. |
| Comment by Lai Siyao [ 12/Sep/12 ] |
|
This is a duplicate of |