[LU-10401] sanity test_133g: timeout during MDT mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run:

Info required for matching: sanity 133g
Info required for matching: sanity 133h

Attachments

Issue Links

duplicates

LU-11761 blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

Resolved

is duplicated by

LU-13613 sanity/133h fails

Resolved

is related to

LU-7688 sanity test_133g: test failed to respond and timed out

Resolved

LU-14789 sanity 133f and 133g are no longer effective

Resolved

is related to

LU-9864 sanity: test_103b failed at ldlm_resource_getref+0x29/0x110

Open

LU-8130 Migrate from libcfs hash to rhashtable

Open

LU-13558 upgrade e2fsprogs to 1.45.5

Resolved

(2 is related to )

Activity

[LU-10401] sanity test_133g: timeout during MDT mount

Yang Sheng added a comment - 21/May/20 7:07 PM

It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense?

Yang Sheng added a comment - 21/May/20 7:07 PM It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense?

Andreas Dilger added a comment - 21/May/20 5:58 PM

After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing:
https://testing.whamcloud.com/test_sets/0a1bd601-7de7-4031-974e-bc138ca14637
https://testing.whamcloud.com/test_sets/9238d945-2814-403b-8ab4-eb96075b4cd4
https://testing.whamcloud.com/test_sets/1073e4a6-5e88-410a-8c80-f0cb15585635
https://testing.whamcloud.com/test_sets/0f75133d-b81d-483b-bfc8-2b90699c9fb9

files do not end with newline: -

It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy.

Andreas Dilger added a comment - 21/May/20 5:58 PM After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing: https://testing.whamcloud.com/test_sets/0a1bd601-7de7-4031-974e-bc138ca14637 https://testing.whamcloud.com/test_sets/9238d945-2814-403b-8ab4-eb96075b4cd4 https://testing.whamcloud.com/test_sets/1073e4a6-5e88-410a-8c80-f0cb15585635 https://testing.whamcloud.com/test_sets/0f75133d-b81d-483b-bfc8-2b90699c9fb9 files do not end with newline: - It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy.

Gerrit Updater added a comment - 21/May/20 1:12 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/
Subject: ~~LU-10401~~ tests: add -F so list_param prints entry type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

Gerrit Updater added a comment - 21/May/20 1:12 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/ Subject: LU-10401 tests: add -F so list_param prints entry type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

Gerrit Updater added a comment - 20/May/20 6:21 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679
Subject: ~~LU-10401~~ test: debug patch
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

Gerrit Updater added a comment - 20/May/20 6:21 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679 Subject: LU-10401 test: debug patch Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

Yang Sheng added a comment - 20/May/20 2:03 PM

Looks like it still exist. Further investigating.

Yang Sheng added a comment - 20/May/20 2:03 PM Looks like it still exist. Further investigating.

Peter Jones added a comment - 20/May/20 12:56 PM

Hmm so did the fix for ~~LU-11761~~ (marked as included in 2.12.3) not work then?

Peter Jones added a comment - 20/May/20 12:56 PM Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?

Yang Sheng added a comment - 20/May/20 11:20 AM

So from stacktrace, This issue should duplicate with ~~LU-11761~~.

Yang Sheng added a comment - 20/May/20 11:20 AM So from stacktrace, This issue should duplicate with LU-11761 .

Gerrit Updater added a comment - 11/May/20 7:25 PM

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
Subject: ~~LU-10401~~ test: add parameter to print entry type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Gerrit Updater added a comment - 11/May/20 7:25 PM Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 Subject: LU-10401 test: add parameter to print entry type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Yang Sheng added a comment - 09/May/20 2:22 PM

Hi, Andreas,

I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

cln..Failing mds1 on trevis-19vm4
CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true

Thanks,
YangSheng

Yang Sheng added a comment - 09/May/20 2:22 PM Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, YangSheng

Andreas Dilger added a comment - 09/May/20 12:07 PM

YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Andreas Dilger added a comment - 09/May/20 12:07 PM YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Yang Sheng added a comment - 07/May/20 3:15 PM

From client log:

[10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
[10864.206882] Call Trace:
[10864.207350]  ? __schedule+0x253/0x830
[10864.208002]  schedule+0x28/0x70
[10864.208566]  schedule_timeout+0x16b/0x390
[10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
[10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
[10864.210904]  ? finish_wait+0x80/0x80
[10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
[10864.212407]  fld_client_rpc+0x277/0x580 [fld]
[10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
[10864.214125]  fld_client_lookup+0x254/0x470 [fld]
[10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
[10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
[10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
[10864.217447]  ? vsnprintf+0x101/0x520
[10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
[10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
[10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
[10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
[10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
[10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
[10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
[10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
[10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
[10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
[10864.227366]  ? schedule_timeout+0x173/0x390
[10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
[10864.229660]  ? finish_wait+0x80/0x80
[10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[10864.231162]  kthread+0x112/0x130
[10864.231747]  ? kthread_flush_work_fn+0x10/0x10
[10864.232516]  ret_from_fork+0x35/0x40

Looks like recovery still not finished.

Yang Sheng added a comment - 07/May/20 3:15 PM From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished.

People

Assignee:: Yang Sheng

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Dec/17 7:21 PM

Updated:: 24/Jun/21 3:15 PM

Resolved:: 30/May/20 9:43 AM