Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1871

MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • server/client: lustre-b2_3/build #1/RHEL6
    • 3
    • 2219

    Description

      I think this is a MPI related issue
      test log from client 1:

      [[55105,1],10]: A high-performance Open MPI point-to-point messaging module
      was unable to find any relevant network interfaces:
      
      Module: OpenFabrics (openib)
        Host: client-10vm1.lab.whamcloud.com
      
      Another transport will be used instead, although this may result in
      lower performance.
      --------------------------------------------------------------------------
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      CMA: no RDMA devices found
      r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1345680120: No such file or directory
      r= 0 l=0000: WR A  203927/0x031c97, AP a  157830/0x026886, TR@  308317/0x04b45d
      [client-10vm1.lab.whamcloud.com:12004] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
      [client-10vm1.lab.whamcloud.com:12004] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
      r= 0 l=1000: WR M  391981/0x05fb2d, AP m  363671/0x058c97, TR@  495994/0x07917a
      

      Attachments

        Activity

          [LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted
          laisiyao Lai Siyao made changes -
          Resolution New: Duplicate [ 3 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
          laisiyao Lai Siyao added a comment -

          This is a duplicate of LU-1881, and this test can pass with the patch for 1881.

          laisiyao Lai Siyao added a comment - This is a duplicate of LU-1881 , and this test can pass with the patch for 1881.
          laisiyao Lai Siyao added a comment -

          I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.

          laisiyao Lai Siyao added a comment - I reproduced once locally, but the result doesn't print backtrace either, I'll try to reproduce again and find more useful information before MDS oops.
          laisiyao Lai Siyao made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Lai Siyao [ laisiyao ]
          pjones Peter Jones added a comment -

          Lai

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Lai Could you please look into this one? Thanks Peter
          adilger Andreas Dilger made changes -
          Summary Original: MDS oops in mdsrate-create-small.sh: stack overflow New: MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted

          Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.

          adilger Andreas Dilger added a comment - Need to run "checkstack" on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage. Unfortunately, there is no stack trace that shows what the callpath is.
          adilger Andreas Dilger made changes -
          Summary Original: CMA: no RDMA devices found New: MDS oops in mdsrate-create-small.sh: stack overflow
          mdiep Minh Diep added a comment -

          this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c
          start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c

          client show that mdsrate was hung

          20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
          20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds.
          20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080
          20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d
          20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5
          20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638
          20:47:24:Call Trace:
          20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
          20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc]
          20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
          20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
          20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc]
          20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv]
          20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre]
          20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre]
          20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre]
          20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30
          20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre]
          20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210
          20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30
          20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90
          20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100
          20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
          20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
          20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds.
          20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

          perhaps it related to this Oops on the mds
          https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c
          https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text

          23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
          23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060
          23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80
          23:02:04:PGD 374e5067 PUD 0
          23:02:04:Thread overran stack, or stack corrupted
          23:02:04:Oops: 0000 1 SMP
          23:02:04:last sysfs file: /sys/devices/system/cpu/possible
          23:02:04:CPU 0
          23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
          23:02:04:
          23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM
          23:02:04:RIP: 0010:[<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80
          23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087
          23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800
          23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0
          23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000
          23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728
          23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0
          23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
          23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0
          23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0)
          23:02:04:Stack:
          23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000
          23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1
          23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000
          23:02:04:Call Trace:
          23:02:04: <IRQ>

          mdiep Minh Diep added a comment - this test passed a couple days before: https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c start failing on 9/9 https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c client show that mdsrate was hung 20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages 20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:47:24:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 20:47:24:mdsrate D 0000000000000000 0 21390 21387 0x00000080 20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d 20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5 20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638 20:47:24:Call Trace: 20:47:24: [<ffffffffa06c042d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc] 20:47:24: [<ffffffffa06ed4a6>] ? __req_capsule_get+0x176/0x750 [ptlrpc] 20:47:24: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180 20:47:24: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50 20:47:24: [<ffffffffa06609b3>] mdc_close+0x193/0x9a0 [mdc] 20:47:24: [<ffffffffa061b8c6>] lmv_close+0x2d6/0x5a0 [lmv] 20:47:24: [<ffffffffa099c7af>] ll_close_inode_openhandle+0x30f/0x1050 [lustre] 20:47:24: [<ffffffffa099d69a>] ll_md_real_close+0x1aa/0x220 [lustre] 20:47:24: [<ffffffffa099d96b>] ll_md_close+0x25b/0x760 [lustre] 20:47:24: [<ffffffff814ff546>] ? down_read+0x16/0x30 20:47:24: [<ffffffffa099df8b>] ll_file_release+0x11b/0x3e0 [lustre] 20:47:24: [<ffffffff8117ca65>] __fput+0xf5/0x210 20:47:24: [<ffffffff8117cba5>] fput+0x25/0x30 20:47:24: [<ffffffff811785cd>] filp_close+0x5d/0x90 20:47:24: [<ffffffff811786a5>] sys_close+0xa5/0x100 20:47:24: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347162469/real 1347162469] req@ffff88007a52dc00 x1412601028282197/t0(0) o250->MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages 20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds. 20:49:26:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. perhaps it related to this Oops on the mds https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text 23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh 23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060 23:02:04:IP: [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:PGD 374e5067 PUD 0 23:02:04:Thread overran stack, or stack corrupted 23:02:04:Oops: 0000 1 SMP 23:02:04:last sysfs file: /sys/devices/system/cpu/possible 23:02:04:CPU 0 23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 23:02:04: 23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM 23:02:04:RIP: 0010: [<ffffffff8104f1c7>] [<ffffffff8104f1c7>] resched_task+0x17/0x80 23:02:04:RSP: 0018:ffff880002203de8 EFLAGS: 00010087 23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800 23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0 23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000 23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728 23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0 23:02:04:FS: 00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 23:02:04:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0 23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0) 23:02:04:Stack: 23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000 23:02:04:<d> 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1 23:02:04:<d> ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000 23:02:04:Call Trace: 23:02:04: <IRQ>

          People

            laisiyao Lai Siyao
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: