[LU-6461] Interop 2.5.3<->master recovery-small test_9: task ll_ost_io00_0 in D state Created: 13/Apr/15  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

server: lustre-master build #2983
client: 2.5.3


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/95224732-dfb8-11e4-b5b0-5254006e85c2.

The sub-test test_9 failed with the following error:

test failed to respond and timed out

OST console

03:45:52:Lustre: DEBUG MARKER: == recovery-small test 9: pause bulk on OST (bug 1420) == 20:39:30 (1428637170)
03:45:52:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x214
03:45:52:LustreError: 11613:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 214 sleeping for 20000000ms
03:45:52:INFO: task ll_ost_io00_002:11613 blocked for more than 120 seconds.
03:45:52:      Not tainted 2.6.32-504.12.2.el6_lustre.x86_64 #1
03:45:52:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
03:45:52:ll_ost_io00_0 D 0000000000000001     0 11613      2 0x00000080
03:45:52: ffff88005081bad0 0000000000000046 ffffffffa04bce48 0000000000000000
03:45:52: ffff88005081bb30 0000000061492ae0 ffffffffa04e5110 ffff88007047448e
03:45:52: 0000004e379f4150 ffffffffa04bce3e ffff880061493098 ffff88005081bfd8
03:45:52:Call Trace:
03:45:52: [<ffffffff8152b162>] schedule_timeout+0x192/0x2e0
03:45:52: [<ffffffff810874f0>] ? process_timeout+0x0/0x10
03:45:52: [<ffffffffa04a6141>] __cfs_fail_timeout_set+0xe1/0x160 [libcfs]
03:45:52: [<ffffffffa0f01b27>] cfs_fail_timeout_set.clone.2+0x27/0x40 [ptlrpc]
03:45:52: [<ffffffffa0f0854b>] tgt_brw_write+0x36b/0x1530 [ptlrpc]
03:45:52: [<ffffffffa04a9161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:45:52: [<ffffffffa04a9161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:45:52: [<ffffffffa04a5798>] ? libcfs_log_return+0x28/0x40 [libcfs]
03:45:52: [<ffffffffa0f046cd>] ? tgt_request_preprocess+0x20d/0x1370 [ptlrpc]
03:45:53: [<ffffffffa04a9161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:45:53: [<ffffffffa0f07a9e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
03:45:53: [<ffffffffa0eb7a51>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
03:45:53: [<ffffffffa0eb6c10>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
03:45:53: [<ffffffff8109e66e>] kthread+0x9e/0xc0
03:45:53: [<ffffffff8100c20a>] child_rip+0xa/0x20
03:45:53: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
03:45:53: [<ffffffff8100c200>] ? child_rip+0x0/0x20
03:45:53:INFO: task ll_ost_io00_002:11613 blocked for more than 120 seconds.
03:45:53:      Not tainted 2.6.32-504.12.2.el6_lustre.x86_64 #1
03:45:54:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.


 Comments   
Comment by Andreas Dilger [ 21/Apr/15 ]

It seems that pause_bulk() was changed in 2.5.52 via "LU-793 ptlrpc: allow client to reconnect with RPC in progress" patch http://review.whamcloud.com/4960 so it may just be a test interop issue. That patch changed pause_bulk() to use fail_val for the timeout value, and in this case waiting 20000s (=6h) is probably not what the test wants. Is fail_val not being set correctly?

Generated at Sat Feb 10 02:00:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.