[LU-10401] sanity test_133g: timeout during MDT mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run:

Info required for matching: sanity 133g
Info required for matching: sanity 133h

Attachments

Issue Links

duplicates

LU-11761 blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

Resolved

is duplicated by

LU-13613 sanity/133h fails

Resolved

is related to

LU-7688 sanity test_133g: test failed to respond and timed out

Resolved

LU-14789 sanity 133f and 133g are no longer effective

Resolved

is related to

LU-9864 sanity: test_103b failed at ldlm_resource_getref+0x29/0x110

Open

LU-8130 Migrate from libcfs hash to rhashtable

Open

LU-13558 upgrade e2fsprogs to 1.45.5

Resolved

(2 is related to )

Activity

[LU-10401] sanity test_133g: timeout during MDT mount

Yang Sheng added a comment - 20/May/20 11:20 AM

So from stacktrace, This issue should duplicate with ~~LU-11761~~.

Yang Sheng added a comment - 20/May/20 11:20 AM So from stacktrace, This issue should duplicate with LU-11761 .

Gerrit Updater added a comment - 11/May/20 7:25 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
Subject: ~~LU-10401~~ test: add parameter to print entry type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Gerrit Updater added a comment - 11/May/20 7:25 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Yang Sheng added a comment - 09/May/20 2:22 PM

Hi, Andreas,

I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

cln..Failing mds1 on trevis-19vm4
CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true

Thanks,
YangSheng

Yang Sheng added a comment - 09/May/20 2:22 PM Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

Andreas Dilger added a comment - 09/May/20 12:07 PM

YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Andreas Dilger added a comment - 09/May/20 12:07 PM YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Yang Sheng added a comment - 07/May/20 3:15 PM

From client log:

[10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
[10864.206882] Call Trace:
[10864.207350]  ? __schedule+0x253/0x830
[10864.208002]  schedule+0x28/0x70
[10864.208566]  schedule_timeout+0x16b/0x390
[10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
[10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
[10864.210904]  ? finish_wait+0x80/0x80
[10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
[10864.212407]  fld_client_rpc+0x277/0x580 [fld]
[10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
[10864.214125]  fld_client_lookup+0x254/0x470 [fld]
[10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
[10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
[10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
[10864.217447]  ? vsnprintf+0x101/0x520
[10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
[10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
[10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
[10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
[10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
[10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
[10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
[10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
[10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
[10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
[10864.227366]  ? schedule_timeout+0x173/0x390
[10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
[10864.229660]  ? finish_wait+0x80/0x80
[10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[10864.231162]  kthread+0x112/0x130
[10864.231747]  ? kthread_flush_work_fn+0x10/0x10
[10864.232516]  ret_from_fork+0x35/0x40

Looks like recovery still not finished.

Yang Sheng added a comment - 07/May/20 3:15 PM From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

Andreas Dilger added a comment - 06/May/20 5:37 PM

YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

Andreas Dilger added a comment - 06/May/20 5:37 PM YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

Andreas Dilger added a comment - 06/May/20 5:35 PM

Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0:
https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855
https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

Andreas Dilger added a comment - 06/May/20 5:35 PM Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855 https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

Andreas Dilger added a comment - 03/May/20 7:21 AM

It's not clear if this patch will solve the failures, it it might. I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.

Andreas Dilger added a comment - 03/May/20 7:21 AM It's not clear if this patch will solve the failures, it it might . I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.

Peter Jones added a comment - 02/May/20 7:24 PM

Landed for 2.14

Peter Jones added a comment - 02/May/20 7:24 PM Landed for 2.14

Gerrit Updater added a comment - 02/May/20 6:36 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/
Subject: ~~LU-10401~~ tests: fix error from 'tr -d='
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

Gerrit Updater added a comment - 02/May/20 6:36 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/ Subject: LU-10401 tests: fix error from 'tr -d=' Project: fs/lustre-release Branch: master Current Patch Set: Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

Andreas Dilger added a comment - 02/May/20 10:08 AM

I haven't been able to get RHEL7.8 to pass at this point.

Andreas Dilger added a comment - 02/May/20 10:08 AM I haven't been able to get RHEL7.8 to pass at this point.

People

Assignee:: Yang Sheng

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Dec/17 7:21 PM

Updated:: 24/Jun/21 3:15 PM

Resolved:: 30/May/20 9:43 AM