[LU-10525] racer test_1: Failure to initialize cl object Created: 16/Jan/18  Updated: 14/Aug/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6758 racer test_1: test failed to respond ... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

racer test_1 - test_1 failed with 2
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/60989662-f935-11e7-a7cd-52540065bddc

test_1 failed with the following error:

test_1 failed with 2

server: 2.10.3 RC1 EL7
client: 2.10.3 RC1 EL6.9

test log

Lustre: Skipped 3 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff880041847c00: Connection restored to 10.9.5.195@tcp (at 10.9.5.195@tcp)
Lustre: Skipped 3 previous similar messages
LustreError: 16503:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000401:0x2366:0x0]: -16
LustreError: 32509:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000401:0x2486:0x0]: -16
INFO: task dir_create.sh:12824 blocked for more than 120 seconds.
      Tainted: G        W  -- ------------    2.6.32-696.18.7.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dir_create.sh D 0000000000000000     0 12824  12691 0x00000080
 ffff88007a03fb68 0000000000000086 ffff88007a03fba8 0000000000000080
 0000000000000000 ffff880059833149 0000004b00000000 ffffffffa103b393
 0000000000000098 0020000000000080 ffff88002c1a1068 ffff88007a03ffd8
Call Trace:
 [<ffffffff8154d146>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffff811b5f68>] ? __d_lookup+0xd8/0x150
 [<ffffffff8154cc6b>] mutex_lock+0x2b/0x50
 [<ffffffff811aa63b>] do_lookup+0x11b/0x230
 [<ffffffff811aace0>] __link_path_walk+0x200/0x1060
 [<ffffffff81342f5c>] ? memory_open+0x3c/0xa0
 [<ffffffff811abdfa>] path_walk+0x6a/0xe0
 [<ffffffff811ad5da>] do_filp_open+0x1fa/0xd20
 [<ffffffff8115aeca>] ? handle_mm_fault+0x2aa/0x3f0
 [<ffffffff812a97fa>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811bac52>] ? alloc_fd+0x92/0x160
 [<ffffffff81197507>] do_sys_open+0x67/0x130
 [<ffffffff8155660b>] ? system_call_after_swapgs+0x16b/0x220
 [<ffffffff81556604>] ? system_call_after_swapgs+0x164/0x220
 [<ffffffff815565fd>] ? system_call_after_swapgs+0x15d/0x220
 [<ffffffff81197610>] sys_open+0x20/0x30
 [<ffffffff815566d6>] system_call_fastpath+0x16/0x1b
 [<ffffffff8155656a>] ? system_call_after_swapgs+0xca/0x220
INFO: task ls:18508 blocked for more than 120 seconds.


 Comments   
Comment by Oleg Drokin [ 17/Jan/18 ]

the message quoted here is just a symptom on the client that something is hogging MDS resources so the client cannot send more than one request and that one request is stuck in processing on mds.

I looked at hte logs on mds and it seems to be somewhat confirmed there with all the "cannot add more time, not sending early replies" messages, but no definite point at what is stuck where and since it was not a crash, we did not collect any crashdumps either, I guess. As such we don't really know all that much about what was going on here that lead to MDS being stuck.

Comment by Sarah Liu [ 21/May/18 ]

+1 on b2_10 https://testing.hpdd.intel.com/test_sets/03bf84be-5c0e-11e8-b9d3-52540065bddc

Comment by James Nunez (Inactive) [ 14/Aug/18 ]

racer test 1 times out with dir_create, dd, mv, lfs, etc. hung is a D state for servers and client el7 with logs at https://testing.whamcloud.com/test_sets/a3be8a94-9c02-11e8-a9f7-52540065bddc. We don't see the 'initialize cl object' error in this hang.

Generated at Sat Feb 10 02:35:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.