[LU-10401] sanity test_133g: timeout during MDT mount Created: 15/Dec/17  Updated: 24/Jun/21  Resolved: 30/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-11761 blocked MDT mount and high cpu usage ... Resolved
is duplicated by LU-13613 sanity/133h fails Resolved
Related
is related to LU-9864 sanity: test_103b failed at ldlm_reso... Open
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
is related to LU-13558 upgrade e2fsprogs to 1.45.5 Resolved
is related to LU-7688 sanity test_133g: test failed to resp... Resolved
is related to LU-14789 sanity 133f and 133g are no longer ef... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run:

Info required for matching: sanity 133g
Info required for matching: sanity 133h



 Comments   
Comment by Andreas Dilger [ 27/Nov/19 ]

+1 on b2_12:
https://testing.whamcloud.com/test_sets/1b84726a-10d7-11ea-a9d7-52540065bddc

Comment by Chris Horn [ 12/Dec/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/1235bfca-16f4-11ea-98f1-52540065bddc

Comment by Jian Yu [ 15/Apr/20 ]

+1 on b2_12:
https://testing.whamcloud.com/test_sets/98320413-d7e1-4cb1-9c78-20b4d126575f

Comment by Gerrit Updater [ 01/May/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38444
Subject: LU-10401 tests: fix error from 'tr -d='
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b18389f3f14d2c88268ec5da15a76295cda30c67

Comment by Andreas Dilger [ 01/May/20 ]

There are error messages printed during the test run:

CMD: trevis-64vm4 /usr/sbin/lctl list_param -R '*' | grep '=' |
				tr -d= | egrep -v 'force_lbug|changelog_mask' |
				xargs badarea_io
trevis-64vm4: tr: invalid option -- '='
trevis-64vm4: Try 'tr --help' for more information.

It isn't clear whether this is causing the test to fail, or is something spurious, but running RHEL7.8 is failing sanity test_133g repeatedly.

Comment by Andreas Dilger [ 02/May/20 ]

I haven't been able to get RHEL7.8 to pass at this point.

Comment by Gerrit Updater [ 02/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/
Subject: LU-10401 tests: fix error from 'tr -d='
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 52b5f4a5c3fc942f2b6aef9dbed780bd2c2a6798

Comment by Peter Jones [ 02/May/20 ]

Landed for 2.14

Comment by Andreas Dilger [ 03/May/20 ]

It's not clear if this patch will solve the failures, it it might. I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs.

Comment by Andreas Dilger [ 06/May/20 ]

Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0:
https://testing.whamcloud.com/test_sets/a85aeb21-b186-4e60-990c-cf89bb26d855
https://testing.whamcloud.com/test_sets/75f7b5a2-05d8-4e7b-8ca7-6253da6a7add

Comment by Andreas Dilger [ 06/May/20 ]

YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing.

Comment by Yang Sheng [ 07/May/20 ]

From client log:

[10864.205961] ptlrpcd_rcv     S    0  7446      2 0x80000080
[10864.206882] Call Trace:
[10864.207350]  ? __schedule+0x253/0x830
[10864.208002]  schedule+0x28/0x70
[10864.208566]  schedule_timeout+0x16b/0x390
[10864.209283]  ? __next_timer_interrupt+0xc0/0xc0
[10864.210087]  ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc]
[10864.210904]  ? finish_wait+0x80/0x80
[10864.211568]  ptlrpc_queue_wait+0x7e/0x210 [ptlrpc]
[10864.212407]  fld_client_rpc+0x277/0x580 [fld]
[10864.213238]  ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs]
[10864.214125]  fld_client_lookup+0x254/0x470 [fld]
[10864.214952]  lmv_fld_lookup+0x8c/0x420 [lmv]
[10864.215712]  lmv_lock_match+0x7c/0x3f0 [lmv]
[10864.216621]  ll_have_md_lock+0x169/0x3b0 [lustre]
[10864.217447]  ? vsnprintf+0x101/0x520
[10864.218092]  ll_md_blocking_ast+0x60d/0xbd0 [lustre]
[10864.218976]  ldlm_cancel_callback+0x7b/0x250 [ptlrpc]
[10864.219868]  ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc]
[10864.220929]  ldlm_lock_cancel+0x55/0x1c0 [ptlrpc]
[10864.221772]  ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc]
[10864.222740]  ldlm_replay_locks+0x662/0x850 [ptlrpc]
[10864.223611]  ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc]
[10864.224735]  ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc]
[10864.225717]  ? after_reply+0x8de/0xd30 [ptlrpc]
[10864.226519]  ptlrpc_check_set+0x50c/0x21f0 [ptlrpc]
[10864.227366]  ? schedule_timeout+0x173/0x390
[10864.228119]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[10864.228953]  ptlrpcd+0x3d0/0x4c0 [ptlrpc]
[10864.229660]  ? finish_wait+0x80/0x80
[10864.230329]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[10864.231162]  kthread+0x112/0x130
[10864.231747]  ? kthread_flush_work_fn+0x10/0x10
[10864.232516]  ret_from_fork+0x35/0x40

Looks like recovery still not finished.

Comment by Andreas Dilger [ 09/May/20 ]

YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently.

Comment by Yang Sheng [ 09/May/20 ]

Hi, Andreas,

I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly?

cln..Failing mds1 on trevis-19vm4
CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4
CMD: trevis-19vm4 umount -d /mnt/lustre-mds1
CMD: trevis-19vm4 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true

Thanks,
YangSheng

Comment by Gerrit Updater [ 11/May/20 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567
Subject: LU-10401 test: add parameter to print entry type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1cd538d7db07365a3edd5ef9af5c260d22a673a4

Comment by Yang Sheng [ 20/May/20 ]

So from stacktrace, This issue should duplicate with LU-11761.

Comment by Peter Jones [ 20/May/20 ]

Hmm so did the fix for LU-11761 (marked as included in 2.12.3) not work then?

Comment by Yang Sheng [ 20/May/20 ]

Looks like it still exist. Further investigating.

Comment by Gerrit Updater [ 20/May/20 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679
Subject: LU-10401 test: debug patch
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: a0a337f38866706829cd0a45b7ec63d97fc4406e

Comment by Gerrit Updater [ 21/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/
Subject: LU-10401 tests: add -F so list_param prints entry type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c54733894f81e854363fbd2d49c141842f73ae4

Comment by Andreas Dilger [ 21/May/20 ]

After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing:
https://testing.whamcloud.com/test_sets/0a1bd601-7de7-4031-974e-bc138ca14637
https://testing.whamcloud.com/test_sets/9238d945-2814-403b-8ab4-eb96075b4cd4
https://testing.whamcloud.com/test_sets/1073e4a6-5e88-410a-8c80-f0cb15585635
https://testing.whamcloud.com/test_sets/0f75133d-b81d-483b-bfc8-2b90699c9fb9

files do not end with newline: -

It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy.

Comment by Yang Sheng [ 21/May/20 ]

It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense?

Comment by Andreas Dilger [ 22/May/20 ]

YS, can you please push a patch to fix FILENAME so we can see which filename is having the problem.

Comment by Gerrit Updater [ 22/May/20 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38699
Subject: LU-10401 tests: print correct filename while failed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: beba18261bb3f7b1f0bd1b18163dc80b8b9414a5

Comment by Yang Sheng [ 22/May/20 ]

From log:

[ 8673.618423] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff8b2112ff4000.rpc_stats | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.637811] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param timeout | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.656550] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param console_backoff | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.675441] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param console_max_delay_centisecs | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8674.097330] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133h: @@@@@@ FAIL: files do not end with newline: - 
[ 8674.515181] Lustre: DEBUG MARKER: sanity test_133h: @@@@@@ FAIL: files do not end with newline: -
Comment by Arshad Hussain [ 23/May/20 ]

+1 On Master : https://testing.whamcloud.com/test_sets/619d15cb-a12b-489b-a085-cb473ad81772

Comment by Bruno Faccini (Inactive) [ 28/May/20 ]

After rebasing with current master, seems that patch-set #32 of my change @ https://review.whamcloud.com/#/c/35856/32 triggers LU-10401 almost solid !!!! And according to this weeks stats for sanity/test_133h a lot of others auto-testing sessions are also affected.
Could the priority for this issue be raised or at least sanity/test_133h been skipped ?

Comment by Peter Jones [ 28/May/20 ]

I literally just hit post on a comment in Gerrit about this. There is a patch - https://review.whamcloud.com/#/c/38699/ - that you could try rebasing on top of

Comment by Bruno Faccini (Inactive) [ 28/May/20 ]

Peter, thanks for your update. But looking at #38699, it looks as a pure debug patch. And according to the current failure rate of sanity/test_133h it should be easy for Yang Sheng to get the associated debug infos now.

Comment by Yang Sheng [ 28/May/20 ]

Hi, Bruno,

The latest 38699 is a fixing patch rather than debug patch. So please rebase your patch on top of it to overcome sanity-133h failure.

Thanks,
YangSheng

Comment by Peter Jones [ 28/May/20 ]

I am just going on that this has been used successfully for a couple of other patches that hit these failures. This patch is in master-next ATM so should land shortly.

Comment by Bruno Faccini (Inactive) [ 28/May/20 ]

Oops my bad, Peter and Yang Sheng, I had missed the fix part of #38699. Thanks for your help.

Comment by Gerrit Updater [ 30/May/20 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/38699/
Subject: LU-10401 procs: print new line based on distro
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0130d7019534611ff05a29fad657226d2eb6b79e

Comment by Andreas Dilger [ 30/May/20 ]

Patch has landed to master for 2.14. Patches will need to be rebased if they are failing because of this issue.

Generated at Sat Feb 10 02:34:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.