[LU-9156] parallel-scale test_metabench: test failed to respond and timed out Created: 24/Feb/17  Updated: 09/Oct/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

trevis-36vm1/2/7/8, Full Group test,
RHEL7.3/zfs, branch master, v2.9.53.19.gdad9ce4, b3534


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/e072c616-f8f8-11e6-aac4-5254006e85c2

From Client 2 console:

05:47:50:[15560.679336] kworker/0:1     D 0000000000000000     0  6749      2 0x00000080
05:47:50:[15560.680977] Workqueue: events key_garbage_collector
05:47:50:[15560.682439]  ffff88003604bc40 0000000000000046 ffff88007aafce70 ffff88003604bfd8
05:47:50:[15560.684148]  ffff88003604bfd8 ffff88003604bfd8 ffff88007aafce70 ffff88003604bd80
05:47:50:[15560.685840]  ffff88003604bd88 7fffffffffffffff ffff88007aafce70 0000000000000000
05:47:50:[15560.687529] Call Trace:
05:47:50:[15560.688780]  [<ffffffff8168b979>] schedule+0x29/0x70
05:47:50:[15560.690227]  [<ffffffff816893c9>] schedule_timeout+0x239/0x2d0
05:47:50:[15560.691732]  [<ffffffff811dc341>] ? __slab_free+0x81/0x2f0
05:47:50:[15560.693210]  [<ffffffff810a5cdc>] ? insert_work+0x4c/0xa0
05:47:50:[15560.694666]  [<ffffffff811dc341>] ? __slab_free+0x81/0x2f0
05:47:50:[15560.696107]  [<ffffffff8168bd56>] wait_for_completion+0x116/0x170
05:47:50:[15560.697604]  [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20
05:47:50:[15560.699063]  [<ffffffff81138a30>] ? __call_rcu+0x2c0/0x2c0
05:47:50:[15560.700518]  [<ffffffff810ad1bb>] wait_rcu_gp+0x5b/0x80
05:47:50:[15560.701979]  [<ffffffff810ad140>] ? ftrace_raw_output_rcu_utilization+0x50/0x50
05:47:50:[15560.703560]  [<ffffffff8113862b>] synchronize_sched+0x3b/0x50
05:47:50:[15560.705020]  [<ffffffff8129b47d>] key_garbage_collector+0x1cd/0x390
05:47:50:[15560.706513]  [<ffffffff810a805b>] process_one_work+0x17b/0x470
05:47:50:[15560.707970]  [<ffffffff810a8e96>] worker_thread+0x126/0x410
05:47:50:[15560.709415]  [<ffffffff810a8d70>] ? rescuer_thread+0x460/0x460
05:47:50:[15560.710882]  [<ffffffff810b064f>] kthread+0xcf/0xe0
05:47:50:[15560.712267]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
05:47:50:[15560.713767]  [<ffffffff81696818>] ret_from_fork+0x58/0x90
05:47:50:[15560.715157]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140


 Comments   
Comment by Andreas Dilger [ 28/Feb/17 ]

Jim, could you please elaborate on why this particular client stack is the important one?

Comment by James Casper [ 09/Mar/17 ]

Have also seen "kworker D" with mmp and sanity-sec. Latest timeout with mmp:

test_7: stops at mds umount:

Stopping /mnt/lustre-mds1failover (opts:) on trevis-44vm3
CMD: trevis-44vm3 umount -d /mnt/lustre-mds1failover

A passing test_7 continues after the mds umount:

Stopping /mnt/lustre-mds1failover (opts:) on onyx-37vm3
CMD: onyx-37vm3 umount -d /mnt/lustre-mds1failover
CMD: onyx-37vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: onyx-37vm4 grep -c /mnt/lustre-ost1failover' ' /proc/mounts
Stopping /mnt/lustre-ost1failover (opts:) on onyx-37vm4
CMD: onyx-37vm4 umount -d /mnt/lustre-ost1failover
CMD: onyx-37vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: onyx-37vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts
CMD: onyx-37vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
CMD: onyx-37vm8 grep -c /mnt/lustre-ost1' ' /proc/mounts
CMD: onyx-37vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
Resetting fail_loc on all nodes...CMD: onyx-37vm1.onyx.hpdd.intel.com,onyx-37vm5,onyx-37vm6,onyx-37vm7,onyx-37vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
22:35:01 (1485498901) waiting for onyx-37vm1.onyx.hpdd.intel.com network 5 secs ...
22:35:01 (1485498901) network interface is UP
Comment by Saurabh Tandan (Inactive) [ 15/Mar/17 ]

An Instance on EE 3.1.1.0 , b_ieel3_0, build# 185
https://testing.hpdd.intel.com/test_sets/8d28da46-0798-11e7-8406-5254006e85c2

Comment by Jian Yu [ 18/May/17 ]

While testing patch https://review.whamcloud.com/25336 on master branch, many tests failed with the similar issue to this one.
https://testing.hpdd.intel.com/test_sessions/c145c4a3-b193-4734-b4c0-1aec2522f09a
https://testing.hpdd.intel.com/test_sessions/6fdca020-ca43-42c8-9fc3-3ffac60c1ba1

I did not find any abnormal messages on consoles but only this stack:

kworker/0:0     D 0000000000000000     0  7535      2 0x00000080
Workqueue: events key_garbage_collector
 ffff88004e1f3c40 0000000000000046 ffff880058ec2f10 ffff88004e1f3fd8
 ffff88004e1f3fd8 ffff88004e1f3fd8 ffff880058ec2f10 ffff88004e1f3d80
 ffff88004e1f3d88 7fffffffffffffff ffff880058ec2f10 0000000000000000
Call Trace:
 [<ffffffff8168c169>] schedule+0x29/0x70
 [<ffffffff81689bc9>] schedule_timeout+0x239/0x2c0
 [<ffffffff811dc861>] ? __slab_free+0x81/0x2f0
 [<ffffffff810b1c17>] ? wake_bit_function_rh+0x37/0x40
 [<ffffffff8168c546>] wait_for_completion+0x116/0x170
 [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
 [<ffffffff81138f20>] ? __call_rcu+0x2c0/0x2c0
 [<ffffffff810ad5bb>] wait_rcu_gp+0x5b/0x80
 [<ffffffff810ad540>] ? ftrace_raw_output_rcu_utilization+0x50/0x50
 [<ffffffff81138b1b>] synchronize_sched+0x3b/0x50
 [<ffffffff8129b9bd>] key_garbage_collector+0x1cd/0x390
 [<ffffffff810a845b>] process_one_work+0x17b/0x470
 [<ffffffff810a9296>] worker_thread+0x126/0x410
 [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
 [<ffffffff810b0a4f>] kthread+0xcf/0xe0
 [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816970d8>] ret_from_fork+0x58/0x90
 [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Comment by James Casper [ 23/May/17 ]

Still looking for guidance with this ticket. If the kworker hang is not the root cause, please point us in the right direction. This is the only indication we can find of something that might be wrong.

Comment by James Casper [ 24/May/17 ]

2.9.57, b3575:
https://testing.hpdd.intel.com/test_sessions/0800ff00-8d4c-4627-878e-566e8a697c01
(Interop w/2.10 on server)

Comment by Jian Yu [ 22/Sep/17 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/401fc94c-9ef8-11e7-b778-5254006e85c2

Generated at Sat Feb 10 02:23:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.