[LU-9156] parallel-scale test_metabench: test failed to respond and timed out Created: 24/Feb/17 Updated: 09/Oct/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Casper | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
trevis-36vm1/2/7/8, Full Group test, |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
https://testing.hpdd.intel.com/test_sessions/e072c616-f8f8-11e6-aac4-5254006e85c2 From Client 2 console: 05:47:50:[15560.679336] kworker/0:1 D 0000000000000000 0 6749 2 0x00000080 05:47:50:[15560.680977] Workqueue: events key_garbage_collector 05:47:50:[15560.682439] ffff88003604bc40 0000000000000046 ffff88007aafce70 ffff88003604bfd8 05:47:50:[15560.684148] ffff88003604bfd8 ffff88003604bfd8 ffff88007aafce70 ffff88003604bd80 05:47:50:[15560.685840] ffff88003604bd88 7fffffffffffffff ffff88007aafce70 0000000000000000 05:47:50:[15560.687529] Call Trace: 05:47:50:[15560.688780] [<ffffffff8168b979>] schedule+0x29/0x70 05:47:50:[15560.690227] [<ffffffff816893c9>] schedule_timeout+0x239/0x2d0 05:47:50:[15560.691732] [<ffffffff811dc341>] ? __slab_free+0x81/0x2f0 05:47:50:[15560.693210] [<ffffffff810a5cdc>] ? insert_work+0x4c/0xa0 05:47:50:[15560.694666] [<ffffffff811dc341>] ? __slab_free+0x81/0x2f0 05:47:50:[15560.696107] [<ffffffff8168bd56>] wait_for_completion+0x116/0x170 05:47:50:[15560.697604] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20 05:47:50:[15560.699063] [<ffffffff81138a30>] ? __call_rcu+0x2c0/0x2c0 05:47:50:[15560.700518] [<ffffffff810ad1bb>] wait_rcu_gp+0x5b/0x80 05:47:50:[15560.701979] [<ffffffff810ad140>] ? ftrace_raw_output_rcu_utilization+0x50/0x50 05:47:50:[15560.703560] [<ffffffff8113862b>] synchronize_sched+0x3b/0x50 05:47:50:[15560.705020] [<ffffffff8129b47d>] key_garbage_collector+0x1cd/0x390 05:47:50:[15560.706513] [<ffffffff810a805b>] process_one_work+0x17b/0x470 05:47:50:[15560.707970] [<ffffffff810a8e96>] worker_thread+0x126/0x410 05:47:50:[15560.709415] [<ffffffff810a8d70>] ? rescuer_thread+0x460/0x460 05:47:50:[15560.710882] [<ffffffff810b064f>] kthread+0xcf/0xe0 05:47:50:[15560.712267] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 05:47:50:[15560.713767] [<ffffffff81696818>] ret_from_fork+0x58/0x90 05:47:50:[15560.715157] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 |
| Comments |
| Comment by Andreas Dilger [ 28/Feb/17 ] |
|
Jim, could you please elaborate on why this particular client stack is the important one? |
| Comment by James Casper [ 09/Mar/17 ] |
|
Have also seen "kworker D" with mmp and sanity-sec. Latest timeout with mmp: test_7: stops at mds umount: Stopping /mnt/lustre-mds1failover (opts:) on trevis-44vm3 CMD: trevis-44vm3 umount -d /mnt/lustre-mds1failover A passing test_7 continues after the mds umount: Stopping /mnt/lustre-mds1failover (opts:) on onyx-37vm3 CMD: onyx-37vm3 umount -d /mnt/lustre-mds1failover CMD: onyx-37vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: onyx-37vm4 grep -c /mnt/lustre-ost1failover' ' /proc/mounts Stopping /mnt/lustre-ost1failover (opts:) on onyx-37vm4 CMD: onyx-37vm4 umount -d /mnt/lustre-ost1failover CMD: onyx-37vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: onyx-37vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts CMD: onyx-37vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: onyx-37vm8 grep -c /mnt/lustre-ost1' ' /proc/mounts CMD: onyx-37vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' Resetting fail_loc on all nodes...CMD: onyx-37vm1.onyx.hpdd.intel.com,onyx-37vm5,onyx-37vm6,onyx-37vm7,onyx-37vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. 22:35:01 (1485498901) waiting for onyx-37vm1.onyx.hpdd.intel.com network 5 secs ... 22:35:01 (1485498901) network interface is UP |
| Comment by Saurabh Tandan (Inactive) [ 15/Mar/17 ] |
|
An Instance on EE 3.1.1.0 , b_ieel3_0, build# 185 |
| Comment by Jian Yu [ 18/May/17 ] |
|
While testing patch https://review.whamcloud.com/25336 on master branch, many tests failed with the similar issue to this one. I did not find any abnormal messages on consoles but only this stack: kworker/0:0 D 0000000000000000 0 7535 2 0x00000080 Workqueue: events key_garbage_collector ffff88004e1f3c40 0000000000000046 ffff880058ec2f10 ffff88004e1f3fd8 ffff88004e1f3fd8 ffff88004e1f3fd8 ffff880058ec2f10 ffff88004e1f3d80 ffff88004e1f3d88 7fffffffffffffff ffff880058ec2f10 0000000000000000 Call Trace: [<ffffffff8168c169>] schedule+0x29/0x70 [<ffffffff81689bc9>] schedule_timeout+0x239/0x2c0 [<ffffffff811dc861>] ? __slab_free+0x81/0x2f0 [<ffffffff810b1c17>] ? wake_bit_function_rh+0x37/0x40 [<ffffffff8168c546>] wait_for_completion+0x116/0x170 [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 [<ffffffff81138f20>] ? __call_rcu+0x2c0/0x2c0 [<ffffffff810ad5bb>] wait_rcu_gp+0x5b/0x80 [<ffffffff810ad540>] ? ftrace_raw_output_rcu_utilization+0x50/0x50 [<ffffffff81138b1b>] synchronize_sched+0x3b/0x50 [<ffffffff8129b9bd>] key_garbage_collector+0x1cd/0x390 [<ffffffff810a845b>] process_one_work+0x17b/0x470 [<ffffffff810a9296>] worker_thread+0x126/0x410 [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460 [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 |
| Comment by James Casper [ 23/May/17 ] |
|
Still looking for guidance with this ticket. If the kworker hang is not the root cause, please point us in the right direction. This is the only indication we can find of something that might be wrong. |
| Comment by James Casper [ 24/May/17 ] |
|
2.9.57, b3575: |
| Comment by Jian Yu [ 22/Sep/17 ] |
|
+1 on master branch: |