Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
Lustre 1.8.9
-
None
-
Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
The async journal commit feature and cancel lock before replay feature are enabled:
http://review.whamcloud.com/1526
filter->fo_syncjournal = 0;
ldlm_cancel_unused_locks_before_replay = 1;
Lustre Tag: v1_8_9_WC1_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256 Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client) Network: TCP (1GigE) ENABLE_QUOTA=yes The async journal commit feature and cancel lock before replay feature are enabled: http://review.whamcloud.com/1526 filter->fo_syncjournal = 0; ldlm_cancel_unused_locks_before_replay = 1;
-
3
-
6814
Description
The async journal commit feature and cancel lock before replay feature are disabled by default on Lustre b1_8 branch. After enabling them, running parallel-scale compilebench test hung as follows:
== parallel-scale test compilebench: compilebench == 09:24:53 (1360776293) OPTIONS: cbench_DIR=/usr/bin cbench_IDIRS=2 cbench_RUNS=2 client-24vm1 client-24vm2.lab.whamcloud.com ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej using working directory /mnt/lustre/d0.compilebench, 2 intial dirs 2 runs native unpatched native-0 222MB in 522.93 seconds (0.43 MB/s) native patched native-0 109MB in 424.16 seconds (0.26 MB/s) native patched compiled native-0 691MB in 69.30 seconds (9.98 MB/s) create dir kernel-0 222MB in 1110.55 seconds (0.20 MB/s) create dir kernel-1 222MB in 3047.05 seconds (0.07 MB/s)
Console log on the client node showed that:
09:24:59:Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 09:24:53 (1360776293) 09:24:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej 09:24:59:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 09:33:01:INFO: task ldlm_poold:16196 blocked for more than 120 seconds. 09:33:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:33:01:ldlm_poold D 0000000000000000 0 16196 2 0x00000080 09:33:01: ffff88000e4bd9b0 0000000000000046 ffff88000e4bd960 ffffffff810097cc 09:33:01: ffff8800117460b8 0000000000000000 00000000004bd970 ffff880002214200 09:33:01: ffff880037871058 ffff88000e4bdfd8 000000000000fb88 ffff880037871058 09:33:01:Call Trace: 09:33:01: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e 09:33:01: [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0 09:33:01: [<ffffffff8105f8ac>] ? try_to_wake_up+0x24c/0x3e0 09:33:01: [<ffffffff814ea743>] wait_for_common+0x123/0x180 09:33:01: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 09:33:01: [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20 09:33:01: [<ffffffffa04dddcd>] __ldlm_bl_to_thread+0x19d/0x1b0 [ptlrpc] 09:33:01: [<ffffffffa04d672b>] ? ldlm_cli_cancel_local+0xab/0x350 [ptlrpc] 09:33:01: [<ffffffffa04e35b9>] ldlm_bl_to_thread+0x379/0x5f0 [ptlrpc] 09:33:01: [<ffffffffa04d88e1>] ? ldlm_cancel_list+0xf1/0x240 [ptlrpc] 09:33:01: [<ffffffffa04e384e>] ldlm_bl_to_thread_list+0x1e/0xa0 [ptlrpc] 09:33:01: [<ffffffffa04d999a>] ldlm_cancel_lru+0x7a/0x1f0 [ptlrpc] 09:33:01: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e 09:33:01: [<ffffffffa04ea36c>] ldlm_cli_pool_recalc+0x1fc/0x2a0 [ptlrpc] 09:33:01: [<ffffffff8107d4eb>] ? try_to_del_timer_sync+0x7b/0xe0 09:33:02: [<ffffffffa04ea508>] ldlm_pool_recalc+0xf8/0x130 [ptlrpc] 09:33:02: [<ffffffffa04eb0ec>] ldlm_pools_recalc+0x9c/0x2d0 [ptlrpc] 09:33:02: [<ffffffffa04ec714>] ldlm_pools_thread_main+0xb4/0x2f0 [ptlrpc] 09:33:02: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 09:33:02: [<ffffffff8100c0ca>] child_rip+0xa/0x20 09:33:02: [<ffffffffa04ec660>] ? ldlm_pools_thread_main+0x0/0x2f0 [ptlrpc] 09:33:02: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Maloo report: https://maloo.whamcloud.com/test_sets/83638de2-7667-11e2-bc2f-52540035b04c