Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5773

obdfilter-survey test 1c: oom occurred on OSS

Details

    • 3
    • 16206

    Description

      While running obdfilter-survey test 1c, oom failure occurred on OSS:

      21:17:56:Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 02:50:56 (1412823056)
      21:17:56:Lustre: DEBUG MARKER: lctl dl | grep obdfilter
      21:17:56:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
      21:17:56:Lustre: Echo OBD driver; http://www.lustre.org/
      21:17:56:hrtimer: interrupt took 7516 ns
      21:17:56:lctl invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
      21:17:56:lctl cpuset=/ mems_allowed=0
      21:17:56:Pid: 19467, comm: lctl Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      21:17:56:Call Trace:
      21:17:56: [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      21:17:56: [<ffffffff81122b80>] ? dump_header+0x90/0x1b0
      21:17:56: [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80
      21:17:56: [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
      21:17:56: [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      21:17:56: [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150
      21:17:56: [<ffffffff811499dd>] ? do_wp_page+0xfd/0x920
      21:17:56: [<ffffffff8133e4f5>] ? misc_open+0x1d5/0x330
      21:17:56: [<ffffffff8114a9fd>] ? handle_pte_fault+0x2cd/0xb00
      21:17:56: [<ffffffff8118d495>] ? chrdev_open+0x125/0x230
      21:17:56: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110
      21:17:56: [<ffffffff8118d370>] ? chrdev_open+0x0/0x230
      21:17:56: [<ffffffff811863bf>] ? __dentry_open+0x23f/0x360
      21:17:56: [<ffffffff812284ef>] ? security_inode_permission+0x1f/0x30
      21:17:56: [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300
      21:17:56: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
      21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
      21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
      21:17:56: [<ffffffff8152c615>] ? page_fault+0x25/0x30
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/973e0216-4fcd-11e4-8e65-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-5773] obdfilter-survey test 1c: oom occurred on OSS
            yujian Jian Yu added a comment - More instance on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/dc5dabda-8074-11e4-a434-5254006e85c2

            Looks dup of LU-3366. Perhaps the vm (OSS) can't afford such test (7 OSTs, 128 brw threads for each OST)?

            niu Niu Yawei (Inactive) added a comment - Looks dup of LU-3366 . Perhaps the vm (OSS) can't afford such test (7 OSTs, 128 brw threads for each OST)?

            Patch http://review.whamcloud.com/#/c/11971/ modified the ost-survey script, not obdfilter. Is there still a connection with this ticket and the 11971 patch?

            jamesanunez James Nunez (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/11971/ modified the ost-survey script, not obdfilter. Is there still a connection with this ticket and the 11971 patch?

            It appears that http://review.whamcloud.com/11971 was changing the obdfilter-survey script, which landed on Oct 5th, and this bug was filed on Oct 20th.

            adilger Andreas Dilger added a comment - It appears that http://review.whamcloud.com/11971 was changing the obdfilter-survey script, which landed on Oct 5th, and this bug was filed on Oct 20th.

            Niu,
            Could you please have a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Niu, Could you please have a look at this one? Thank you!

            I found something strange in the OST logs - hundreds of lctl processes are running on the node, like it is a fork bomb:

            11:04:49:[28448]     0 28448     3820      230   0       0             0 lctl
            11:04:49:[28449]     0 28449     3820      229   0       0             0 lctl
            11:04:49:[28450]     0 28450     3820      228   1       0             0 lctl
            11:04:49:[28451]     0 28451     3820      230   0       0             0 lctl
            11:04:49:[28452]     0 28452     3820      229   1       0             0 lctl
            [repeats]
            
            adilger Andreas Dilger added a comment - I found something strange in the OST logs - hundreds of lctl processes are running on the node, like it is a fork bomb: 11:04:49:[28448] 0 28448 3820 230 0 0 0 lctl 11:04:49:[28449] 0 28449 3820 229 0 0 0 lctl 11:04:49:[28450] 0 28450 3820 228 1 0 0 lctl 11:04:49:[28451] 0 28451 3820 230 0 0 0 lctl 11:04:49:[28452] 0 28452 3820 229 1 0 0 lctl [repeats]
            yujian Jian Yu added a comment - More instance on master branch: https://testing.hpdd.intel.com/test_sets/19e98088-7e99-11e4-ab67-5254006e85c2
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/12401f30-7a4e-11e4-b9fd-5254006e85c2 https://testing.hpdd.intel.com/test_sets/393c5232-7a55-11e4-807e-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ee460ed2-7980-11e4-aa22-5254006e85c2 https://testing.hpdd.intel.com/test_sets/f55941c6-6a58-11e4-b203-5254006e85c2 https://testing.hpdd.intel.com/test_sets/87b3b698-5cdd-11e4-8561-5254006e85c2 https://testing.hpdd.intel.com/test_sets/bce60eac-4eef-11e4-872e-5254006e85c2 https://testing.hpdd.intel.com/test_sets/6702eeee-7d55-11e4-943c-5254006e85c2 https://testing.hpdd.intel.com/test_sets/6502c6d6-7d33-11e4-943c-5254006e85c2
            yujian Jian Yu added a comment - More instance on master branch: https://testing.hpdd.intel.com/test_sets/e6ec8538-6b45-11e4-88ff-5254006e85c2
            yujian Jian Yu added a comment -

            Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/
            Distro/Arch: RHEL6.5/x86_64 + SLES11SP3/x86_64 (Server + Client)

            The same oom failure occurred on OSS while running obdfilter-survey test 1c:

            automount invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
            automount cpuset=/ mems_allowed=0
            Pid: 10749, comm: automount Not tainted 2.6.32-431.29.2.el6_lustre.gf7f3864.x86_64 #1
            Call Trace:
             [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
             [<ffffffff81122b80>] ? dump_header+0x90/0x1b0
             [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80
             [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
             [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
             [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150
             [<ffffffff8114ae6d>] ? handle_pte_fault+0x73d/0xb00
             [<ffffffff811f4570>] ? proc_delete_inode+0x0/0x80
             [<ffffffff8128b96e>] ? number+0x2ee/0x320
             [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300
             [<ffffffff8128b96e>] ? number+0x2ee/0x320
             [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
             [<ffffffff8128c210>] ? string+0x40/0x100
             [<ffffffff8128d776>] ? vsnprintf+0x336/0x5e0
             [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
             [<ffffffff8152c615>] ? page_fault+0x25/0x30
             [<ffffffff8128e71e>] ? copy_user_generic_unrolled+0x3e/0xb0
             [<ffffffff811aedb2>] ? seq_read+0x2d2/0x400
             [<ffffffff81189a95>] ? vfs_read+0xb5/0x1a0
             [<ffffffff81189bd1>] ? sys_read+0x51/0x90
             [<ffffffff810e204e>] ? __audit_syscall_exit+0x25e/0x290
             [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
            

            Maloo report: https://testing.hpdd.intel.com/test_sets/4bf3d7bc-6874-11e4-acbe-5254006e85c2

            The patch for LU-5079 was included in this build, so, this is not a duplicate issue.

            yujian Jian Yu added a comment - Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/ Distro/Arch: RHEL6.5/x86_64 + SLES11SP3/x86_64 (Server + Client) The same oom failure occurred on OSS while running obdfilter-survey test 1c: automount invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 automount cpuset=/ mems_allowed=0 Pid: 10749, comm: automount Not tainted 2.6.32-431.29.2.el6_lustre.gf7f3864.x86_64 #1 Call Trace: [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff81122b80>] ? dump_header+0x90/0x1b0 [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80 [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0 [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0 [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150 [<ffffffff8114ae6d>] ? handle_pte_fault+0x73d/0xb00 [<ffffffff811f4570>] ? proc_delete_inode+0x0/0x80 [<ffffffff8128b96e>] ? number+0x2ee/0x320 [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300 [<ffffffff8128b96e>] ? number+0x2ee/0x320 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 [<ffffffff8128c210>] ? string+0x40/0x100 [<ffffffff8128d776>] ? vsnprintf+0x336/0x5e0 [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0 [<ffffffff8152c615>] ? page_fault+0x25/0x30 [<ffffffff8128e71e>] ? copy_user_generic_unrolled+0x3e/0xb0 [<ffffffff811aedb2>] ? seq_read+0x2d2/0x400 [<ffffffff81189a95>] ? vfs_read+0xb5/0x1a0 [<ffffffff81189bd1>] ? sys_read+0x51/0x90 [<ffffffff810e204e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Maloo report: https://testing.hpdd.intel.com/test_sets/4bf3d7bc-6874-11e4-acbe-5254006e85c2 The patch for LU-5079 was included in this build, so, this is not a duplicate issue.

            The patch to fix this was landed under LU-5079.

            adilger Andreas Dilger added a comment - The patch to fix this was landed under LU-5079 .

            People

              niu Niu Yawei (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: