[LU-1542] Failure on sanity.sh, subtest test_132 Created: 19/Jun/12  Updated: 03/Sep/13  Resolved: 03/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4076

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/dd62db8a-b9da-11e1-86c2-52540035b04c.

The sub-test test_132 failed with the following error:

test failed to respond and timed out

This may relate to the startup issue in LU-1541, since this subtest is remounting the servers with SOM enabled. However, I'm filing it separately for now for tracking and in case it ends up being a separate bug.

Info required for matching: sanity 132



 Comments   
Comment by Ian Colle (Inactive) [ 28/Jun/12 ]

23:48:19:Lustre: MGS has stopped.
23:48:20:LustreError: 8956:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
23:48:20:LustreError: 8956:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108

Comment by Ian Colle (Inactive) [ 28/Jun/12 ]

https://maloo.whamcloud.com/test_sets/747c424e-c166-11e1-9055-52540035b04c

From Client Console
08:15:04:LustreError: 6547:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
08:15:04:LustreError: 6547:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 2 previous similar messages
08:15:04:LustreError: 6547:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
08:15:05:LustreError: 6547:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 2 previous similar messages
08:15:05:Lustre: Unmounted lustre-client
08:16:03:LNet: 7089:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
08:16:03:LNet: 7089:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message
08:16:11:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
08:16:11:Lustre: MGC10.10.4.110@tcp: Reactivating import
08:16:11:Lustre: Increasing default stripe size to min 1048576
08:16:12:Lustre: Mounted lustre-client
08:16:12:LNet: 7509:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
08:16:12:LNet: 7509:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message
08:16:14:Lustre: DEBUG MARKER: Using TIMEOUT=20
08:16:15:LustreError: 7803:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
08:16:19:Lustre: DEBUG MARKER: cancel_lru_locks osc start
08:16:20:LustreError: 7498:0:(cl_lock.c:2171:cl_lock_hold_add()) ASSERTION( lock->cll_state != CLS_FREEING ) failed:
08:16:20:LustreError: 7498:0:(cl_lock.c:2171:cl_lock_hold_add()) LBUG
08:16:20:Pid: 7498, comm: ll_close

Comment by Li Wei (Inactive) [ 31/Jul/12 ]

https://maloo.whamcloud.com/test_sets/fb5f0dea-daf8-11e1-9ebb-52540035b04c

Comment by Ian Colle (Inactive) [ 13/Aug/12 ]

https://maloo.whamcloud.com/test_sets/53d27ff8-e561-11e1-ae4e-52540035b04c

Comment by Keith Mannthey (Inactive) [ 06/Feb/13 ]

https://maloo.whamcloud.com/test_sessions/13798a36-6f5a-11e2-93c1-52540035b04c

Well this may not be 100$ this is the same issue but it an assertion failure in the same spot that causes the MDS to reboot while the test_132 times out.

The logs tell me 4/100 failures Feb06.

14:06:17:LustreError: 11-0: lustre-OST0004-osc-MDT0000: Communicating with 10.10.4.195@tcp, operation ost_connect failed with -19.
14:06:18:Lustre: DEBUG MARKER: lctl get_param -n timeout
14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
14:06:19:Lustre: DEBUG MARKER: Using TIMEOUT=20
14:06:19:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.sys.jobid_var=procname_uid
14:07:12:Lustre: MGS: haven't heard from client 2b7f8516-fc0a-afb9-790c-1965aaaa46c2 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880078f2e800, cur 1360015629 expire 1360015599 last 1360015579
14:07:23:Lustre: lustre-MDT0000: haven't heard from client 5fcc94dc-d9c0-7c5c-7665-6b8afe791bb0 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007832ec00, cur 1360015634 expire 1360015604 last 1360015584
14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed: 
14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) LBUG
14:07:23:Pid: 17820, comm: ll_evictor
14:07:23:
14:07:23:Call Trace:
14:07:23: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
14:07:23: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
14:07:23: [<ffffffffa0664755>] lu_ucred_assert+0x45/0x50 [obdclass]
14:07:23: [<ffffffffa0c52c66>] mdd_xattr_sanity_check+0x36/0x1f0 [mdd]
14:07:23: [<ffffffffa0c58221>] mdd_xattr_del+0xf1/0x540 [mdd]
14:07:23: [<ffffffffa0e3fe0a>] mdt_som_attr_set+0xfa/0x390 [mdt]
14:07:23: [<ffffffffa0e401ec>] mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt]
14:07:23: [<ffffffffa0f100c9>] ? osp_key_init+0x59/0x1a0 [osp]
14:07:23: [<ffffffffa0e40c4b>] mdt_ioepoch_close+0x2ab/0x3b0 [mdt]
14:07:23: [<ffffffffa0e411fe>] mdt_mfd_close+0x4ae/0x6e0 [mdt]
14:07:23: [<ffffffffa0e1297e>] mdt_obd_disconnect+0x3ae/0x4d0 [mdt]
14:07:23: [<ffffffffa061cd78>] class_fail_export+0x248/0x580 [obdclass]
14:07:23: [<ffffffffa07f9079>] ping_evictor_main+0x249/0x640 [ptlrpc]
14:07:23: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffff8100c0ca>] child_rip+0xa/0x20
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
14:07:23:
14:07:23:Kernel panic - not syncing: LBUG
.....
Comment by Keith Mannthey (Inactive) [ 06/Feb/13 ]

Is seem the above may be caused by the patch being tested. http://review.whamcloud.com/5222

Comment by Andreas Dilger [ 03/Sep/13 ]

Closing this old Orion bug for now. I don't think the last comments were related to this problem.

Generated at Sat Feb 10 01:17:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.