[LU-10008] ost-pools test_12: test failed to respond and timed out Created: 19/Sep/17  Updated: 13/Oct/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9777 replay-single test_70f: test failed t... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/508403d8-9b4d-11e7-ba27-5254006e85c2.

The sub-test test_12 failed with the following error:

test failed to respond and timed out

cannot find error msg



 Comments   
Comment by James Nunez (Inactive) [ 21/Sep/17 ]

In the client stack_trace log, we see

18:42:30:[24905.438996] Lustre: DEBUG MARKER: dmesg
18:42:30:[24929.865694] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.testpool 2>/dev/null || echo foo
18:42:30:[24929.873323] Lustre: DEBUG MARKER: lctl get_param -n lov.lustre-*.pools.testpool 2>/dev/null || echo foo
18:42:30:[24960.427629] INFO: task tee:26113 blocked for more than 120 seconds.
18:42:30:[24960.428399] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
18:42:30:[24960.429173] tee             D ffffffff816a7610     0 26113   1528 0x00000080
18:42:30:[24960.430038]  ffff88006560f9b0 0000000000000082 ffff880064303f40 ffff88006560ffd8
18:42:30:[24960.430865]  ffff88006560ffd8 ffff88006560ffd8 ffff880064303f40 ffff88007fd16cc0
18:42:30:[24960.431698]  0000000000000000 7fffffffffffffff ffff88007ff5d728 ffffffff816a7610
18:42:30:[24960.432571] Call Trace:
18:42:30:[24960.432841]  [<ffffffff816a7610>] ? bit_wait+0x50/0x50
18:42:30:[24960.433414]  [<ffffffff816a94e9>] schedule+0x29/0x70
18:42:30:[24960.433932]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
18:42:30:[24960.434571]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
18:42:30:[24960.435191]  [<ffffffff816a7610>] ? bit_wait+0x50/0x50
18:42:30:[24960.435772]  [<ffffffff816a8b6d>] io_schedule_timeout+0xad/0x130
18:42:30:[24960.436366]  [<ffffffff816a8c08>] io_schedule+0x18/0x20
18:42:30:[24960.436892]  [<ffffffff816a7621>] bit_wait_io+0x11/0x50
18:42:30:[24960.437492]  [<ffffffff816a7145>] __wait_on_bit+0x65/0x90
18:42:30:[24960.438031]  [<ffffffff816a7610>] ? bit_wait+0x50/0x50
18:42:30:[24960.438623]  [<ffffffff816a71f1>] out_of_line_wait_on_bit+0x81/0xb0
18:42:30:[24960.439242]  [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40
18:42:30:[24960.439838]  [<ffffffffc04dcdb3>] nfs_wait_on_request+0x33/0x40 [nfs]
18:42:30:[24960.440550]  [<ffffffffc04e1eb3>] nfs_updatepage+0x153/0x8d0 [nfs]
18:42:30:[24960.441175]  [<ffffffffc04d1371>] nfs_write_end+0x141/0x350 [nfs]
18:42:30:[24960.441850]  [<ffffffff81182539>] generic_file_buffered_write+0x189/0x2a0
18:42:30:[24960.442520]  [<ffffffff811848f2>] __generic_file_aio_write+0x1e2/0x400
18:42:30:[24960.443173]  [<ffffffff8121cf41>] ? update_time+0x81/0xd0
18:42:30:[24960.443813]  [<ffffffff81184b69>] generic_file_aio_write+0x59/0xa0
18:42:30:[24960.444430]  [<ffffffffc04d07db>] nfs_file_write+0xbb/0x1e0 [nfs]
18:42:30:[24960.445038]  [<ffffffff8120027d>] do_sync_write+0x8d/0xd0
18:42:30:[24960.445649]  [<ffffffff81200d3d>] vfs_write+0xbd/0x1e0
18:42:30:[24960.446153]  [<ffffffff81201b4f>] SyS_write+0x7f/0xe0
18:42:30:[24960.446675]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b

Which looks like LU-9777 and should have been fixed by DCO-7492; contention on the onyx mgmt node.

Generated at Sat Feb 10 02:31:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.