Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1871

MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • server/client: lustre-b2_3/build #1/RHEL6
    • 3
    • 2219

    Description

      I think this is a MPI related issue
      test log from client 1:

      [[55105,1],10]: A high-performance Open MPI point-to-point messaging module
      was unable to find any relevant network interfaces:
      
      Module: OpenFabrics (openib)
        Host: client-10vm1.lab.whamcloud.com
      
      Another transport will be used instead, although this may result in
      lower performance.
      --------------------------------------------------------------------------
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1345680120: No such file or directory
      r= 0 l=0000: WR A  203927/0x031c97, AP a  157830/0x026886, TR@  308317/0x04b45d
      [client-10vm1.lab.whamcloud.com:12004] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
      [client-10vm1.lab.whamcloud.com:12004] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
      r= 0 l=1000: WR M  391981/0x05fb2d, AP m  363671/0x058c97, TR@  495994/0x07917a
      

      Attachments

        Activity

          [LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted
          laisiyao Lai Siyao added a comment -

          This is a duplicate of LU-1881, and this test can pass with the patch for 1881.

          laisiyao Lai Siyao added a comment - This is a duplicate of LU-1881 , and this test can pass with the patch for 1881.
          laisiyao Lai Siyao added a comment -

          I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.

          laisiyao Lai Siyao added a comment - I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.
          pjones Peter Jones added a comment -

          Lai

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please look into this one? Thanks Peter

          Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.

          adilger Andreas Dilger added a comment - Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.
          mdiep Minh Diep added a comment -

          this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c
          start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c

          client show that mdsrate was hung

          20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
          20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds.
          20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080
          20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d
          20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5
          20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638
          20:47:24:Call Trace:
          20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
          20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
          20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
          20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
          20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc]
          20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv]
          20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre]
          20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre]
          20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre]
          20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30
          20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre]
          20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210
          20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30
          20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90
          20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100
          20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
          20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds.
          20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

          perhaps it related to this Oops on the mds
          https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c
          https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text

          23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
          23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060
          23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80
          23:02:04:PGD 374e5067 PUD 0
          23:02:04:Thread overran stack, or stack corrupted
          23:02:04:Oops: 0000 1 SMP
          23:02:04:last sysfs file: /sys/devices/system/cpu/possible
          23:02:04:CPU 0
          23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
          23:02:04:
          23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM
          23:02:04:RIP: 0010:[<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80
          23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087
          23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800
          23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0
          23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000
          23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728
          23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0
          23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
          23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0
          23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0)
          23:02:04:Stack:
          23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000
          23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1
          23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000
          23:02:04:Call Trace:
          23:02:04: <IRQ>

          mdiep Minh Diep added a comment - this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c client show that mdsrate was hung 20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080 20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d 20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5 20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638 20:47:24:Call Trace: 20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc] 20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc] 20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc] 20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv] 20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre] 20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre] 20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre] 20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30 20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre] 20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210 20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30 20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90 20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100 20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages 20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. perhaps it related to this Oops on the mds https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text 23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh 23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060 23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:PGD 374e5067 PUD 0 23:02:04:Thread overran stack, or stack corrupted 23:02:04:Oops: 0000 1 SMP 23:02:04:last sysfs file: /sys/devices/system/cpu/possible 23:02:04:CPU 0 23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 23:02:04: 23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM 23:02:04:RIP: 0010: [<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087 23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800 23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0 23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000 23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728 23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0 23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0 23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0) 23:02:04:Stack: 23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000 23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1 23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000 23:02:04:Call Trace: 23:02:04: <IRQ>
          mdiep Minh Diep added a comment -

          what has changed was:

          before: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' "

          today: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt 'f%%d' "

          we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue.

          If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options.

          mdiep Minh Diep added a comment - what has changed was: before: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt 'f%%d' " today: + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt 'f%%d' " we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue. If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options.
          mdiep Minh Diep added a comment -
          mdiep Minh Diep added a comment - the 'lower performance' can be ignored: http://cac.engin.umich.edu/faq.html May be this will help http://www.open-mpi.org/community/lists/users/2012/02/18465.php

          From Minh's comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device.

          It is very hard to be sure because no link to failing tests was provided, it's not even possible to know which test was running.

          chris Chris Gearing (Inactive) added a comment - From Minh's comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device. It is very hard to be sure because no link to failing tests was provided, it's not even possible to know which test was running.
          mdiep Minh Diep added a comment -

          are there more logs, dmesg, console? The test should not fail even with these message

          mdiep Minh Diep added a comment - are there more logs, dmesg, console? The test should not fail even with these message

          People

            laisiyao Lai Siyao
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: