[LU-12985] sanity test_60g: Timeout occurred after 140 mins Created: 20/Nov/19  Updated: 12/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14307 sanity test 60g crashes with ‘unable ... Resolved
is related to LU-15644 failed llog cancel should not generat... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for liuying <emoly.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b24183b8-0ab1-11ea-bbc3-52540065bddc

test_60g failed with the following error:

Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests

<<Please provide additional information about the failure here>>
The dmesg on MDS1 showed at https://testing.whamcloud.com/test_logs/bd780c0c-0ab1-11ea-bbc3-52540065bddc/show_text:

[ 4806.548607] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 05:13:15 (1574140395)
[ 4806.746423] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4807.187510] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4807.774430] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4808.259070] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4808.472999] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4809.183645] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4809.491198] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4809.492745] Lustre: Skipped 1 previous similar message
[ 4809.493924] LustreError: 16969:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5
[ 4809.496232] LustreError: 16969:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5
[ 4809.880666] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4810.586256] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4810.762596] LustreError: 19797:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 llog-records: rc = -116
[ 4810.765173] LustreError: 19797:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
[ 4811.290016] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4811.828688] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4811.829600] Lustre: Skipped 2 previous similar messages
[ 4811.842495] LustreError: 21702:0:(llog_osd.c:248:llog_osd_read_header()) lustre-MDT0001-osp-MDT0002: error reading [0x240000402:0x3:0x0] log header size 92: rc = -14
...
[ 4851.890821] Lustre: mdt00_003: service thread pid 21702 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 4851.893891] Pid: 21702, comm: mdt00_003 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019
[ 4851.895471] Call Trace:
[ 4851.896004]  [<ffffffffa1788ce8>] call_rwsem_down_read_failed+0x18/0x30
[ 4851.897198]  [<ffffffffc0ae01de>] llog_cat_prep_log+0x33e/0x360 [obdclass]
[ 4851.898438]  [<ffffffffc0ae0279>] llog_cat_declare_add_rec+0x79/0x260 [obdclass]
[ 4851.899728]  [<ffffffffc0ad7178>] llog_declare_add+0x78/0x1a0 [obdclass]
[ 4851.900974]  [<ffffffffc0e85cde>] top_trans_start+0x17e/0x940 [ptlrpc]
[ 4851.902438]  [<ffffffffc12f32b4>] lod_trans_start+0x34/0x40 [lod]
[ 4851.903651]  [<ffffffffc13aef9a>] mdd_trans_start+0x1a/0x20 [mdd]
[ 4851.904822]  [<ffffffffc13932f2>] mdd_create+0xbe2/0x1630 [mdd]
[ 4851.905840]  [<ffffffffc121ce84>] mdt_create+0xb54/0x10e0 [mdt]
[ 4851.906916]  [<ffffffffc121d57b>] mdt_reint_create+0x16b/0x360 [mdt]
[ 4851.908040]  [<ffffffffc1222ab3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 4851.909115]  [<ffffffffc11fc9e0>] mdt_reint_internal+0x7b0/0xba0 [mdt]
[ 4851.910272]  [<ffffffffc12086d7>] mdt_reint+0x67/0x140 [mdt]
[ 4851.911292]  [<ffffffffc0e6ffda>] tgt_request_handle+0x97a/0x1620 [ptlrpc]
[ 4851.912527]  [<ffffffffc0e16ab6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 4851.913969]  [<ffffffffc0e1a5ec>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[ 4851.915083]  [<ffffffffa14c2e81>] kthread+0xd1/0xe0
[ 4851.915998]  [<ffffffffa1b77c37>] ret_from_fork_nospec_end+0x0/0x39

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_60g - Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests



 Comments   
Comment by Emoly Liu [ 21/Nov/19 ]

This issue has occurred six times since this September in review-dne-zfs-part-1 testing on master.

Comment by Chris Horn [ 05/Jun/20 ]

+1 on master: https://testing.whamcloud.com/test_sessions/1442755c-07f6-43cb-8623-8fcc99595c73

Comment by Bruno Faccini (Inactive) [ 08/Jun/20 ]

+1 on latest master :
https://testing.whamcloud.com/test_sessions/8423dc4b-8f77-4f29-857b-096fefea780e (review-dne-zfs-part-1)

Comment by Andreas Dilger [ 10/Jun/20 ]

+1 on master https://testing.whamcloud.com/test_sets/e8e97342-30de-40f3-af85-2a44a7a06a51 (review-dne-zfs-part-1)

Generated at Sat Feb 10 02:57:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.