Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for liuying <emoly.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b24183b8-0ab1-11ea-bbc3-52540065bddc

test_60g failed with the following error:

Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests

<<Please provide additional information about the failure here>>
The dmesg on MDS1 showed at https://testing.whamcloud.com/test_logs/bd780c0c-0ab1-11ea-bbc3-52540065bddc/show_text:

[ 4806.548607] Lustre: DEBUG MARKER: == sanity test 60g: transaction abort won't cause MDT hung =========================================== 05:13:15 (1574140395)
[ 4806.746423] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4807.187510] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4807.774430] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4808.259070] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4808.472999] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4809.183645] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4809.491198] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4809.492745] Lustre: Skipped 1 previous similar message
[ 4809.493924] LustreError: 16969:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0000-osd: fail to cancel 1 llog-records: rc = -5
[ 4809.496232] LustreError: 16969:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0000-osd: fail to cancel 1 of 1 llog-records: rc = -5
[ 4809.880666] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4810.586256] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4810.762596] LustreError: 19797:0:(llog_cat.c:744:llog_cat_cancel_arr_rec()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 llog-records: rc = -116
[ 4810.765173] LustreError: 19797:0:(llog_cat.c:781:llog_cat_cancel_records()) lustre-MDT0001-osp-MDT0002: fail to cancel 1 of 1 llog-records: rc = -116
[ 4811.290016] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
[ 4811.828688] Lustre: *** cfs_fail_loc=19a, val=0***
[ 4811.829600] Lustre: Skipped 2 previous similar messages
[ 4811.842495] LustreError: 21702:0:(llog_osd.c:248:llog_osd_read_header()) lustre-MDT0001-osp-MDT0002: error reading [0x240000402:0x3:0x0] log header size 92: rc = -14
...
[ 4851.890821] Lustre: mdt00_003: service thread pid 21702 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 4851.893891] Pid: 21702, comm: mdt00_003 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Fri Nov 8 17:23:54 UTC 2019
[ 4851.895471] Call Trace:
[ 4851.896004]  [<ffffffffa1788ce8>] call_rwsem_down_read_failed+0x18/0x30
[ 4851.897198]  [<ffffffffc0ae01de>] llog_cat_prep_log+0x33e/0x360 [obdclass]
[ 4851.898438]  [<ffffffffc0ae0279>] llog_cat_declare_add_rec+0x79/0x260 [obdclass]
[ 4851.899728]  [<ffffffffc0ad7178>] llog_declare_add+0x78/0x1a0 [obdclass]
[ 4851.900974]  [<ffffffffc0e85cde>] top_trans_start+0x17e/0x940 [ptlrpc]
[ 4851.902438]  [<ffffffffc12f32b4>] lod_trans_start+0x34/0x40 [lod]
[ 4851.903651]  [<ffffffffc13aef9a>] mdd_trans_start+0x1a/0x20 [mdd]
[ 4851.904822]  [<ffffffffc13932f2>] mdd_create+0xbe2/0x1630 [mdd]
[ 4851.905840]  [<ffffffffc121ce84>] mdt_create+0xb54/0x10e0 [mdt]
[ 4851.906916]  [<ffffffffc121d57b>] mdt_reint_create+0x16b/0x360 [mdt]
[ 4851.908040]  [<ffffffffc1222ab3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 4851.909115]  [<ffffffffc11fc9e0>] mdt_reint_internal+0x7b0/0xba0 [mdt]
[ 4851.910272]  [<ffffffffc12086d7>] mdt_reint+0x67/0x140 [mdt]
[ 4851.911292]  [<ffffffffc0e6ffda>] tgt_request_handle+0x97a/0x1620 [ptlrpc]
[ 4851.912527]  [<ffffffffc0e16ab6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 4851.913969]  [<ffffffffc0e1a5ec>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[ 4851.915083]  [<ffffffffa14c2e81>] kthread+0xd1/0xe0
[ 4851.915998]  [<ffffffffa1b77c37>] ret_from_fork_nospec_end+0x0/0x39

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_60g - Timeout occurred after 140 mins, last suite running was sanity, restarting cluster to continue tests

Attachments

Issue Links

is related to

LU-15644 failed llog cancel should not generate an error

Resolved

is related to

LU-14307 sanity test 60g crashes with ‘unable to handle kernel NULL pointer dereference at 0000000000000020’ in lquota_lqe_debug

Resolved

sanity test_60g: Timeout occurred after 140 mins

Details

Description

Attachments

Issue Links

Activity

People

Dates