Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2812

parallel-scale test_compilebench hung: task ldlm_poold:16196 blocked for more than 120 seconds

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 1.8.9
    • None
    • 3
    • 6814

    Description

      The async journal commit feature and cancel lock before replay feature are disabled by default on Lustre b1_8 branch. After enabling them, running parallel-scale compilebench test hung as follows:

      == parallel-scale test compilebench: compilebench == 09:24:53 (1360776293)
      OPTIONS:
      cbench_DIR=/usr/bin
      cbench_IDIRS=2
      cbench_RUNS=2
      client-24vm1
      client-24vm2.lab.whamcloud.com
      ./compilebench -D /mnt/lustre/d0.compilebench -i 2         -r 2 --makej
      using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs
      native unpatched native-0 222MB in 522.93 seconds (0.43 MB/s)
      native patched native-0 109MB in 424.16 seconds (0.26 MB/s)
      native patched compiled native-0 691MB in 69.30 seconds (9.98 MB/s)
      create dir kernel-0 222MB in 1110.55 seconds (0.20 MB/s)
      create dir kernel-1 222MB in 3047.05 seconds (0.07 MB/s)
      

      Console log on the client node showed that:

      09:24:59:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 09:24:53 (1360776293)
      09:24:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2         -r 2 --makej
      09:24:59:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
      09:33:01:INFO: task ldlm_poold:16196 blocked for more than 120 seconds.
      09:33:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      09:33:01:ldlm_poold    D 0000000000000000     0 16196      2 0x00000080
      09:33:01: ffff88000e4bd9b0 0000000000000046 ffff88000e4bd960 ffffffff810097cc
      09:33:01: ffff8800117460b8 0000000000000000 00000000004bd970 ffff880002214200
      09:33:01: ffff880037871058 ffff88000e4bdfd8 000000000000fb88 ffff880037871058
      09:33:01:Call Trace:
      09:33:01: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
      09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
      09:33:01: [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
      09:33:01: [<ffffffff8105f8ac>] ? try_to_wake_up+0x24c/0x3e0
      09:33:01: [<ffffffff814ea743>] wait_for_common+0x123/0x180
      09:33:01: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      09:33:01: [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
      09:33:01: [<ffffffffa04dddcd>] __ldlm_bl_to_thread+0x19d/0x1b0 [ptlrpc]
      09:33:01: [<ffffffffa04d672b>] ? ldlm_cli_cancel_local+0xab/0x350 [ptlrpc]
      09:33:01: [<ffffffffa04e35b9>] ldlm_bl_to_thread+0x379/0x5f0 [ptlrpc]
      09:33:01: [<ffffffffa04d88e1>] ? ldlm_cancel_list+0xf1/0x240 [ptlrpc]
      09:33:01: [<ffffffffa04e384e>] ldlm_bl_to_thread_list+0x1e/0xa0 [ptlrpc]
      09:33:01: [<ffffffffa04d999a>] ldlm_cancel_lru+0x7a/0x1f0 [ptlrpc]
      09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
      09:33:01: [<ffffffffa04ea36c>] ldlm_cli_pool_recalc+0x1fc/0x2a0 [ptlrpc]
      09:33:01: [<ffffffff8107d4eb>] ? try_to_del_timer_sync+0x7b/0xe0
      09:33:02: [<ffffffffa04ea508>] ldlm_pool_recalc+0xf8/0x130 [ptlrpc]
      09:33:02: [<ffffffffa04eb0ec>] ldlm_pools_recalc+0x9c/0x2d0 [ptlrpc]
      09:33:02: [<ffffffffa04ec714>] ldlm_pools_thread_main+0xb4/0x2f0 [ptlrpc]
      09:33:02: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      09:33:02: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      09:33:02: [<ffffffffa04ec660>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
      09:33:02: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      Maloo report: https://maloo.whamcloud.com/test_sets/83638de2-7667-11e2-bc2f-52540035b04c

      Attachments

        Activity

          People

            wc-triage WC Triage
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: