[LU-10401] sanity test_133g: timeout during MDT mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run:

Info required for matching: sanity 133g
Info required for matching: sanity 133h

Attachments

Issue Links

duplicates

LU-11761 blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

Resolved

is duplicated by

LU-13613 sanity/133h fails

Resolved

is related to

LU-7688 sanity test_133g: test failed to respond and timed out

Resolved

LU-14789 sanity 133f and 133g are no longer effective

Resolved

is related to

LU-9864 sanity: test_103b failed at ldlm_resource_getref+0x29/0x110

Open

LU-8130 Migrate from libcfs hash to rhashtable

Open

LU-13558 upgrade e2fsprogs to 1.45.5

Resolved

(2 is related to )

Activity

[LU-10401] sanity test_133g: timeout during MDT mount

Gerrit Updater added a comment - 21/May/20 1:12 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/
Subject: ~~LU-10401~~ tests: add -F so list_param prints entry type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

Gerrit Updater added a comment - 21/May/20 1:12 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/ Subject: LU-10401 tests: add -F so list_param prints entry type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

Gerrit Updater added a comment - 20/May/20 6:21 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679
Subject: ~~LU-10401~~ test: debug patch
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

Gerrit Updater added a comment - 20/May/20 6:21 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679 Subject: LU-10401 test: debug patch Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

Yang Sheng added a comment - 20/May/20 2:03 PM

Looks like it still exist. Further investigating.

Yang Sheng added a comment - 20/May/20 2:03 PM Looks like it still exist. Further investigating.

Peter Jones added a comment - 20/May/20 12:56 PM

Hmm so did the fix for ~~LU-11761~~ (marked as included in 2.12.3) not work then?

Peter Jones added a comment - 20/May/20 12:56 PM Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?

Yang Sheng added a comment - 20/May/20 11:20 AM

So from stacktrace, This issue should duplicate with ~~LU-11761~~.

Yang Sheng added a comment - 20/May/20 11:20 AM So from stacktrace, This issue should duplicate with LU-11761 .

Gerrit Updater added a comment - 11/May/20 7:25 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
Subject: ~~LU-10401~~ test: add parameter to print entry type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Gerrit Updater added a comment - 11/May/20 7:25 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Yang Sheng added a comment - 09/May/20 2:22 PM

Hi, Andreas,

I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

cln..Failing mds1 on trevis-19vm4
CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true

Thanks,
YangSheng

Yang Sheng added a comment - 09/May/20 2:22 PM Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

Andreas Dilger added a comment - 09/May/20 12:07 PM

YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Andreas Dilger added a comment - 09/May/20 12:07 PM YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Yang Sheng added a comment - 07/May/20 3:15 PM

From client log:

[10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
[10864.206882] Call Trace:
[10864.207350]  ? __schedule+0x253/0x830
[10864.208002]  schedule+0x28/0x70
[10864.208566]  schedule_timeout+0x16b/0x390
[10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
[10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
[10864.210904]  ? finish_wait+0x80/0x80
[10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
[10864.212407]  fld_client_rpc+0x277/0x580 [fld]
[10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
[10864.214125]  fld_client_lookup+0x254/0x470 [fld]
[10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
[10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
[10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
[10864.217447]  ? vsnprintf+0x101/0x520
[10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
[10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
[10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
[10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
[10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
[10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
[10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
[10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
[10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
[10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
[10864.227366]  ? schedule_timeout+0x173/0x390
[10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
[10864.229660]  ? finish_wait+0x80/0x80
[10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[10864.231162]  kthread+0x112/0x130
[10864.231747]  ? kthread_flush_work_fn+0x10/0x10
[10864.232516]  ret_from_fork+0x35/0x40

Looks like recovery still not finished.

Yang Sheng added a comment - 07/May/20 3:15 PM From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

Andreas Dilger added a comment - 06/May/20 5:37 PM

YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

Andreas Dilger added a comment - 06/May/20 5:37 PM YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

Andreas Dilger added a comment - 06/May/20 5:35 PM

Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0:
https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855
https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

Andreas Dilger added a comment - 06/May/20 5:35 PM Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855 https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

People

Assignee:: Yang Sheng

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Dec/17 7:21 PM

Updated:: 24/Jun/21 3:15 PM

Resolved:: 30/May/20 9:43 AM