[LU-2812] parallel-scale test_compilebench hung: task ldlm_poold:16196 blocked for more than 120 seconds Created: 14/Feb/13  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The async journal commit feature and cancel lock before replay feature are enabled:
http://review.whamcloud.com/1526

filter->fo_syncjournal = 0;
ldlm_cancel_unused_locks_before_replay = 1;


Severity: 3
Rank (Obsolete): 6814

 Description   

The async journal commit feature and cancel lock before replay feature are disabled by default on Lustre b1_8 branch. After enabling them, running parallel-scale compilebench test hung as follows:

== parallel-scale test compilebench: compilebench == 09:24:53 (1360776293)
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=2
cbench_RUNS=2
client-24vm1
client-24vm2.lab.whamcloud.com
./compilebench -D /mnt/lustre/d0.compilebench -i 2         -r 2 --makej
using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs
native unpatched native-0 222MB in 522.93 seconds (0.43 MB/s)
native patched native-0 109MB in 424.16 seconds (0.26 MB/s)
native patched compiled native-0 691MB in 69.30 seconds (9.98 MB/s)
create dir kernel-0 222MB in 1110.55 seconds (0.20 MB/s)
create dir kernel-1 222MB in 3047.05 seconds (0.07 MB/s)

Console log on the client node showed that:

09:24:59:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 09:24:53 (1360776293)
09:24:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2         -r 2 --makej
09:24:59:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
09:33:01:INFO: task ldlm_poold:16196 blocked for more than 120 seconds.
09:33:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:33:01:ldlm_poold    D 0000000000000000     0 16196      2 0x00000080
09:33:01: ffff88000e4bd9b0 0000000000000046 ffff88000e4bd960 ffffffff810097cc
09:33:01: ffff8800117460b8 0000000000000000 00000000004bd970 ffff880002214200
09:33:01: ffff880037871058 ffff88000e4bdfd8 000000000000fb88 ffff880037871058
09:33:01:Call Trace:
09:33:01: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
09:33:01: [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
09:33:01: [<ffffffff8105f8ac>] ? try_to_wake_up+0x24c/0x3e0
09:33:01: [<ffffffff814ea743>] wait_for_common+0x123/0x180
09:33:01: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
09:33:01: [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
09:33:01: [<ffffffffa04dddcd>] __ldlm_bl_to_thread+0x19d/0x1b0 [ptlrpc]
09:33:01: [<ffffffffa04d672b>] ? ldlm_cli_cancel_local+0xab/0x350 [ptlrpc]
09:33:01: [<ffffffffa04e35b9>] ldlm_bl_to_thread+0x379/0x5f0 [ptlrpc]
09:33:01: [<ffffffffa04d88e1>] ? ldlm_cancel_list+0xf1/0x240 [ptlrpc]
09:33:01: [<ffffffffa04e384e>] ldlm_bl_to_thread_list+0x1e/0xa0 [ptlrpc]
09:33:01: [<ffffffffa04d999a>] ldlm_cancel_lru+0x7a/0x1f0 [ptlrpc]
09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
09:33:01: [<ffffffffa04ea36c>] ldlm_cli_pool_recalc+0x1fc/0x2a0 [ptlrpc]
09:33:01: [<ffffffff8107d4eb>] ? try_to_del_timer_sync+0x7b/0xe0
09:33:02: [<ffffffffa04ea508>] ldlm_pool_recalc+0xf8/0x130 [ptlrpc]
09:33:02: [<ffffffffa04eb0ec>] ldlm_pools_recalc+0x9c/0x2d0 [ptlrpc]
09:33:02: [<ffffffffa04ec714>] ldlm_pools_thread_main+0xb4/0x2f0 [ptlrpc]
09:33:02: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
09:33:02: [<ffffffff8100c0ca>] child_rip+0xa/0x20
09:33:02: [<ffffffffa04ec660>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc]
09:33:02: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/83638de2-7667-11e2-bc2f-52540035b04c



 Comments   
Comment by Jian Yu [ 14/Feb/13 ]

It seems this is related to LU-1376.

Comment by Oleg Drokin [ 14/Feb/13 ]

I suspect this i the same problem as LU-874, with this particular patch aimed at addressing it: http://review.whamcloud.com/1900

Comment by James A Simmons [ 14/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:28:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.