Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3366

Test failure obdfilter-survey, subtest test_1c: oom-killer

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.4.2, Lustre 2.5.1
    • None
    • 3
    • 8329

    Description

      This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4ac8e14a-bf36-11e2-88e0-52540035b04c.

      The sub-test test_1c failed with the following error:

      test failed to respond and timed out

      I see the following several times in the OSS console log:

      06:25:50:lctl invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
      

      OSS stack trace looks like:

      08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0
      08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290
      08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      08:48:13:INFO: task lctl:11345 blocked for more than 120 seconds.
      08:48:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      08:48:13:lctl          D 0000000000000000     0 11345  11038 0x00000080
      08:48:13: ffff880031179588 0000000000000082 ffff8800ffffffff 000088f8cde20d22
      08:48:13: ffff88007a65b380 ffff880073cded70 0000000001306642 ffffffffae6a97e1
      08:48:13: ffff880053b15ab8 ffff880031179fd8 000000000000fb88 ffff880053b15ab8
      08:48:13:Call Trace:
      08:48:13: [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0
      08:48:13: [<ffffffff8150e723>] io_schedule+0x73/0xc0
      08:48:13: [<ffffffff8125ea08>] get_request_wait+0x108/0x1d0
      08:48:13: [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
      08:48:13: [<ffffffff81255c8b>] ? elv_merge+0x1cb/0x200
      08:48:13: [<ffffffff8125eb4d>] blk_queue_bio+0x7d/0x5a0
      08:48:13: [<ffffffff8125d1fe>] generic_make_request+0x26e/0x550
      08:48:13: [<ffffffff8111c713>] ? mempool_alloc+0x63/0x140
      08:48:13: [<ffffffff8125d56d>] submit_bio+0x8d/0x120
      08:48:13: [<ffffffffa103e39e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
      08:48:13: [<ffffffffa168adac>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
      08:48:13: [<ffffffffa168b1cc>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]
      08:48:13: [<ffffffffa001702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
      08:48:13: [<ffffffffa00172d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
      08:48:13: [<ffffffffa168d788>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]
      08:48:13: [<ffffffffa0494a43>] ofd_preprw_read+0x253/0x7f0 [ofd]
      08:48:13: [<ffffffffa049574a>] ofd_preprw+0x76a/0x13c0 [ofd]
      08:48:13: [<ffffffffa05734eb>] echo_client_iocontrol+0x207b/0x3bd0 [obdecho]
      08:48:13: [<ffffffff81143767>] ? handle_pte_fault+0xf7/0xb50
      08:48:13: [<ffffffffa103247f>] class_handle_ioctl+0x12cf/0x1e90 [obdclass]
      08:48:13: [<ffffffffa101a2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
      08:48:13: [<ffffffff81195012>] vfs_ioctl+0x22/0xa0
      08:48:13: [<ffffffff8103c7b8>] ? pvclock_clocksource_read+0x58/0xd0
      08:48:13: [<ffffffff811951b4>] do_vfs_ioctl+0x84/0x580
      08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0
      08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290
      08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      08:48:13:
      08:48:13:<ConMan> Console [wtm-14vm8] disconnected from <wtm-14:6007> at 05-17 08:47.
      

      Info required for matching: obdfilter-survey 1c

      Attachments

        Issue Links

          Activity

            [LU-3366] Test failure obdfilter-survey, subtest test_1c: oom-killer
            yujian Jian Yu added a comment -

            Lustre client build: http://build.whamcloud.com/job/lustre-b2_5/13/
            Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2)

            The same issue occurred while running obdfilter-survey test 1c:
            https://maloo.whamcloud.com/test_sets/cc27c530-7ed5-11e3-8a9b-52540035b04c

            Console log on OSS:

            Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 23:34:40 (1389857680)^M
            Lustre: DEBUG MARKER: lctl dl | grep obdfilter^M
            Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'^M
            Lustre: Echo OBD driver; http://www.lustre.org/^M
            INFO: task lctl:23681 blocked for more than 120 seconds.^M
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
            lctl          D 0000000000000001     0 23681  23655 0x00000080^M
             ffff880046571588 0000000000000086 0000000000000000 00001e43189c6042^M
             ffff88007d773e00 ffff880076e4cad0 0000000000651b0a ffffffffad288d24^M
             ffff880051459098 ffff880046571fd8 000000000000fb88 ffff880051459098^M
            Call Trace:^M
             [<ffffffff8150ed93>] io_schedule+0x73/0xc0^M
             [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0^M
             [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40^M
             [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0^M
             [<ffffffff8125d51e>] generic_make_request+0x25e/0x520^M
             [<ffffffff8111c763>] ? mempool_alloc+0x63/0x140^M
             [<ffffffff8125d86d>] submit_bio+0x8d/0x120^M
             [<ffffffffa05c041e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]^M
             [<ffffffffa0ce2ffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]^M
             [<ffffffffa0ce341c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]^M
             [<ffffffffa0d4702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]^M
             [<ffffffffa0d472d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]^M
             [<ffffffffa0ce59d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]^M
             [<ffffffffa0db5bd3>] ofd_preprw_read+0x253/0x7f0 [ofd]^M
             [<ffffffffa0db68ea>] ofd_preprw+0x77a/0x1480 [ofd]^M
             [<ffffffffa06ec473>] echo_client_iocontrol+0x2003/0x3b40 [obdecho]^M
             [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0^M
             [<ffffffffa05b449f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]^M
             [<ffffffffa059c2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M
             [<ffffffff81195382>] vfs_ioctl+0x22/0xa0^M
             [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0^M
             [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580^M
             [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0^M
             [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290^M
             [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b^M
            
            yujian Jian Yu added a comment - Lustre client build: http://build.whamcloud.com/job/lustre-b2_5/13/ Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2) The same issue occurred while running obdfilter-survey test 1c: https://maloo.whamcloud.com/test_sets/cc27c530-7ed5-11e3-8a9b-52540035b04c Console log on OSS: Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 23:34:40 (1389857680)^M Lustre: DEBUG MARKER: lctl dl | grep obdfilter^M Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'^M Lustre: Echo OBD driver; http://www.lustre.org/^M INFO: task lctl:23681 blocked for more than 120 seconds.^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000001 0 23681 23655 0x00000080^M ffff880046571588 0000000000000086 0000000000000000 00001e43189c6042^M ffff88007d773e00 ffff880076e4cad0 0000000000651b0a ffffffffad288d24^M ffff880051459098 ffff880046571fd8 000000000000fb88 ffff880051459098^M Call Trace:^M [<ffffffff8150ed93>] io_schedule+0x73/0xc0^M [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0^M [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40^M [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0^M [<ffffffff8125d51e>] generic_make_request+0x25e/0x520^M [<ffffffff8111c763>] ? mempool_alloc+0x63/0x140^M [<ffffffff8125d86d>] submit_bio+0x8d/0x120^M [<ffffffffa05c041e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]^M [<ffffffffa0ce2ffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]^M [<ffffffffa0ce341c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]^M [<ffffffffa0d4702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]^M [<ffffffffa0d472d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]^M [<ffffffffa0ce59d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]^M [<ffffffffa0db5bd3>] ofd_preprw_read+0x253/0x7f0 [ofd]^M [<ffffffffa0db68ea>] ofd_preprw+0x77a/0x1480 [ofd]^M [<ffffffffa06ec473>] echo_client_iocontrol+0x2003/0x3b40 [obdecho]^M [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0^M [<ffffffffa05b449f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]^M [<ffffffffa059c2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff81195382>] vfs_ioctl+0x22/0xa0^M [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0^M [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580^M [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0^M [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290^M [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b^M
            green Oleg Drokin added a comment -

            So it seems the run is on less than 2G RAM on all nodes, we should not OOM, but oon the other hand it probably does not make much sense to run perf benchmark in small memory setup so it might be worth it disabling the test on low memory.

            In addition somebody should dig into what was it that used all the RAM on OSS

            green Oleg Drokin added a comment - So it seems the run is on less than 2G RAM on all nodes, we should not OOM, but oon the other hand it probably does not make much sense to run perf benchmark in small memory setup so it might be worth it disabling the test on low memory. In addition somebody should dig into what was it that used all the RAM on OSS
            yujian Jian Yu added a comment -

            Lustre Branch: master (tag 2.4.50)

            Another instance: https://maloo.whamcloud.com/test_sets/d369af0a-c154-11e2-8769-52540035b04c

            yujian Jian Yu added a comment - Lustre Branch: master (tag 2.4.50) Another instance: https://maloo.whamcloud.com/test_sets/d369af0a-c154-11e2-8769-52540035b04c

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: