Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for liuying <emoly.liu@intel.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b24183b8-0ab1-11ea-bbc3-52540065bddc
test_60g failed with the following error:
Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests
<<Please provide additional information about the failure here>>
The dmesg on MDS1 showed at https://testing.whamcloud.com/test_logs/bd780c0c-0ab1-11ea-bbc3-52540065bddc/show_text:
[ 4806.548607] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 05:13:15 (1574140395) [ 4806.746423] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4807.187510] Lustre: *** cfs_fail_loc=19a, val=0*** [ 4807.774430] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4808.259070] Lustre: *** cfs_fail_loc=19a, val=0*** [ 4808.472999] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4809.183645] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4809.491198] Lustre: *** cfs_fail_loc=19a, val=0*** [ 4809.492745] Lustre: Skipped 1 previous similar message [ 4809.493924] LustreError: 16969:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5 [ 4809.496232] LustreError: 16969:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5 [ 4809.880666] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4810.586256] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4810.762596] LustreError: 19797:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 llog-records: rc = -116 [ 4810.765173] LustreError: 19797:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116 [ 4811.290016] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 4811.828688] Lustre: *** cfs_fail_loc=19a, val=0*** [ 4811.829600] Lustre: Skipped 2 previous similar messages [ 4811.842495] LustreError: 21702:0:(llog_osd.c:248:llog_osd_read_header()) lustre-MDT0001-osp-MDT0002: error reading [0x240000402:0x3:0x0] log header size 92: rc = -14 ... [ 4851.890821] Lustre: mdt00_003: service thread pid 21702 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 4851.893891] Pid: 21702, comm: mdt00_003 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019 [ 4851.895471] Call Trace: [ 4851.896004] [<ffffffffa1788ce8>] call_rwsem_down_read_failed+0x18/0x30 [ 4851.897198] [<ffffffffc0ae01de>] llog_cat_prep_log+0x33e/0x360 [obdclass] [ 4851.898438] [<ffffffffc0ae0279>] llog_cat_declare_add_rec+0x79/0x260 [obdclass] [ 4851.899728] [<ffffffffc0ad7178>] llog_declare_add+0x78/0x1a0 [obdclass] [ 4851.900974] [<ffffffffc0e85cde>] top_trans_start+0x17e/0x940 [ptlrpc] [ 4851.902438] [<ffffffffc12f32b4>] lod_trans_start+0x34/0x40 [lod] [ 4851.903651] [<ffffffffc13aef9a>] mdd_trans_start+0x1a/0x20 [mdd] [ 4851.904822] [<ffffffffc13932f2>] mdd_create+0xbe2/0x1630 [mdd] [ 4851.905840] [<ffffffffc121ce84>] mdt_create+0xb54/0x10e0 [mdt] [ 4851.906916] [<ffffffffc121d57b>] mdt_reint_create+0x16b/0x360 [mdt] [ 4851.908040] [<ffffffffc1222ab3>] mdt_reint_rec+0x83/0x210 [mdt] [ 4851.909115] [<ffffffffc11fc9e0>] mdt_reint_internal+0x7b0/0xba0 [mdt] [ 4851.910272] [<ffffffffc12086d7>] mdt_reint+0x67/0x140 [mdt] [ 4851.911292] [<ffffffffc0e6ffda>] tgt_request_handle+0x97a/0x1620 [ptlrpc] [ 4851.912527] [<ffffffffc0e16ab6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 4851.913969] [<ffffffffc0e1a5ec>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 4851.915083] [<ffffffffa14c2e81>] kthread+0xd1/0xe0 [ 4851.915998] [<ffffffffa1b77c37>] ret_from_fork_nospec_end+0x0/0x39
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_60g - Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests