[LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: None
Labels:
- releases
Environment:
server/client: lustre-b2_3/build #1/RHEL6

Severity:
3
Rank (Obsolete):
2219

Description

I think this is a MPI related issue
test log from client 1:

[[55105,1],10]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: client-10vm1.lab.whamcloud.com

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1345680120: No such file or directory
r= 0 l=0000: WR A  203927/0x031c97, AP a  157830/0x026886, TR@  308317/0x04b45d
[client-10vm1.lab.whamcloud.com:12004] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[client-10vm1.lab.whamcloud.com:12004] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
r= 0 l=1000: WR M  391981/0x05fb2d, AP m  363671/0x058c97, TR@  495994/0x07917a

Attachments

Activity

[LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted

Lai Siyao added a comment - 12/Sep/12 11:23 AM

This is a duplicate of ~~LU-1881~~, and this test can pass with the patch for 1881.

Lai Siyao added a comment - 12/Sep/12 11:23 AM This is a duplicate of LU-1881 , and this test can pass with the patch for 1881.

Lai Siyao added a comment - 11/Sep/12 10:20 AM

I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.

Lai Siyao added a comment - 11/Sep/12 10:20 AM I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.

Peter Jones added a comment - 10/Sep/12 12:58 PM

Lai

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 10/Sep/12 12:58 PM Lai Could you please look into this one? Thanks Peter

Andreas Dilger added a comment - 10/Sep/12 12:14 PM

Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.

Andreas Dilger added a comment - 10/Sep/12 12:14 PM Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.

Minh Diep added a comment - 10/Sep/12 10:01 AM

this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c
start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c

client show that mdsrate was hung

20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds.
20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080
20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d
20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5
20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638
20:47:24:Call Trace:
20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc]
20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv]
20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre]
20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre]
20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre]
20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30
20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre]
20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210
20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30
20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90
20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100
20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds.
20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

perhaps it related to this Oops on the mds
https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text

23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060
23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80
23:02:04:PGD 374e5067 PUD 0
23:02:04:Thread overran stack, or stack corrupted
23:02:04:Oops: 0000 1 SMP
23:02:04:last sysfs file: /sys/devices/system/cpu/possible
23:02:04:CPU 0
23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
23:02:04:
23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM
23:02:04:RIP: 0010:[<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80
23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087
23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800
23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0
23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000
23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728
23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0
23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0
23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0)
23:02:04:Stack:
23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000
23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1
23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000
23:02:04:Call Trace:
23:02:04: <IRQ>

Minh Diep added a comment - 10/Sep/12 10:01 AM this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c client show that mdsrate was hung 20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080 20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d 20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5 20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638 20:47:24:Call Trace: 20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc] 20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc] 20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc] 20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv] 20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre] 20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre] 20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre] 20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30 20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre] 20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210 20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30 20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90 20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100 20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages 20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. perhaps it related to this Oops on the mds https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text 23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh 23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060 23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:PGD 374e5067 PUD 0 23:02:04:Thread overran stack, or stack corrupted 23:02:04:Oops: 0000 1 SMP 23:02:04:last sysfs file: /sys/devices/system/cpu/possible 23:02:04:CPU 0 23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 23:02:04: 23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM 23:02:04:RIP: 0010: [<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087 23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800 23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0 23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000 23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728 23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0 23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0 23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0) 23:02:04:Stack: 23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000 23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1 23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000 23:02:04:Call Trace: 23:02:04: <IRQ>

Minh Diep added a comment - 10/Sep/12 9:49 AM

what has changed was:

before: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' "

today: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt 'f%%d' "

we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue.

If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options.

Minh Diep added a comment - 10/Sep/12 9:49 AM what has changed was: before: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' " today: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt 'f%%d' " we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue. If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options.

Minh Diep added a comment - 10/Sep/12 9:43 AM

the 'lower performance' can be ignored: http://cac.engin.umich.edu/faq.html
May be this will help http://www.open-mpi.org/community/lists/users/2012/02/18465.php

Minh Diep added a comment - 10/Sep/12 9:43 AM the 'lower performance' can be ignored: http://cac.engin.umich.edu/faq.html May be this will help http://www.open-mpi.org/community/lists/users/2012/02/18465.php

Chris Gearing (Inactive) added a comment - 10/Sep/12 9:36 AM

From Minh's comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device.

It is very hard to be sure because no link to failing tests was provided, it's not even possible to know which test was running.

Chris Gearing (Inactive) added a comment - 10/Sep/12 9:36 AM From Minh's comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device. It is very hard to be sure because no link to failing tests was provided, it's not even possible to know which test was running.

Minh Diep added a comment - 10/Sep/12 12:31 AM

are there more logs, dmesg, console? The test should not fail even with these message

Minh Diep added a comment - 10/Sep/12 12:31 AM are there more logs, dmesg, console? The test should not fail even with these message

People

Assignee:: Lai Siyao

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Aug/12 5:22 PM

Updated:: 12/Sep/12 11:23 AM

Resolved:: 12/Sep/12 11:23 AM