[LU-10401] sanity test_133g: timeout during MDT mount Created: 15/Dec/17 Updated: 24/Jun/21 Resolved: 30/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: Info required for matching: sanity 133g |
| Comments |
| Comment by Andreas Dilger [ 27/Nov/19 ] |
|
+1 on b2_12: |
| Comment by Chris Horn [ 12/Dec/19 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/1235bfca-16f4-11ea-98f1-52540065bddc |
| Comment by Jian Yu [ 15/Apr/20 ] |
|
+1 on b2_12: |
| Comment by Gerrit Updater [ 01/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38444 |
| Comment by Andreas Dilger [ 01/May/20 ] |
|
There are error messages printed during the test run: CMD: trevis-64vm4 /usr/sbin/lctl list_param -R '*' | grep '=' | tr -d= | egrep -v 'force_lbug|changelog_mask' | xargs badarea_io trevis-64vm4: tr: invalid option -- '=' trevis-64vm4: Try 'tr --help' for more information. It isn't clear whether this is causing the test to fail, or is something spurious, but running RHEL7.8 is failing sanity test_133g repeatedly. |
| Comment by Andreas Dilger [ 02/May/20 ] |
|
I haven't been able to get RHEL7.8 to pass at this point. |
| Comment by Gerrit Updater [ 02/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38444/ |
| Comment by Peter Jones [ 02/May/20 ] |
|
Landed for 2.14 |
| Comment by Andreas Dilger [ 03/May/20 ] |
|
It's not clear if this patch will solve the failures, it it might. I'd like to keep this issue open until I can get a passing result from e2fsprogs. Unfortunately, there is no way to test e2fsprogs in autotest against a specific patch, so this had to be landed before I could retry with e2fsprogs. |
| Comment by Andreas Dilger [ 06/May/20 ] |
|
Some examples of recent failures, which are specific to CentOS 7.8 and RHEL 8.0: |
| Comment by Andreas Dilger [ 06/May/20 ] |
|
YS, can you please look into why this test is failing. It only has problems with CentOS 7.8 and RHEL 8.1, not the regular CentOS 7.7 runs that are part of review testing. |
| Comment by Yang Sheng [ 07/May/20 ] |
|
From client log: [10864.205961] ptlrpcd_rcv S 0 7446 2 0x80000080 [10864.206882] Call Trace: [10864.207350] ? __schedule+0x253/0x830 [10864.208002] schedule+0x28/0x70 [10864.208566] schedule_timeout+0x16b/0x390 [10864.209283] ? __next_timer_interrupt+0xc0/0xc0 [10864.210087] ptlrpc_set_wait+0x4ba/0x6e0 [ptlrpc] [10864.210904] ? finish_wait+0x80/0x80 [10864.211568] ptlrpc_queue_wait+0x7e/0x210 [ptlrpc] [10864.212407] fld_client_rpc+0x277/0x580 [fld] [10864.213238] ? cfs_trace_unlock_tcd+0x2e/0x80 [libcfs] [10864.214125] fld_client_lookup+0x254/0x470 [fld] [10864.214952] lmv_fld_lookup+0x8c/0x420 [lmv] [10864.215712] lmv_lock_match+0x7c/0x3f0 [lmv] [10864.216621] ll_have_md_lock+0x169/0x3b0 [lustre] [10864.217447] ? vsnprintf+0x101/0x520 [10864.218092] ll_md_blocking_ast+0x60d/0xbd0 [lustre] [10864.218976] ldlm_cancel_callback+0x7b/0x250 [ptlrpc] [10864.219868] ? ldlm_lock_remove_from_lru_nolock+0x38/0xf0 [ptlrpc] [10864.220929] ldlm_lock_cancel+0x55/0x1c0 [ptlrpc] [10864.221772] ldlm_cli_cancel_list_local+0x8f/0x300 [ptlrpc] [10864.222740] ldlm_replay_locks+0x662/0x850 [ptlrpc] [10864.223611] ptlrpc_import_recovery_state_machine+0x868/0x970 [ptlrpc] [10864.224735] ptlrpc_connect_interpret+0x11f0/0x22d0 [ptlrpc] [10864.225717] ? after_reply+0x8de/0xd30 [ptlrpc] [10864.226519] ptlrpc_check_set+0x50c/0x21f0 [ptlrpc] [10864.227366] ? schedule_timeout+0x173/0x390 [10864.228119] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [10864.228953] ptlrpcd+0x3d0/0x4c0 [ptlrpc] [10864.229660] ? finish_wait+0x80/0x80 [10864.230329] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [10864.231162] kthread+0x112/0x130 [10864.231747] ? kthread_flush_work_fn+0x10/0x10 [10864.232516] ret_from_fork+0x35/0x40 Looks like recovery still not finished. |
| Comment by Andreas Dilger [ 09/May/20 ] |
|
YS, can you please make a patch to increase the timeout, or speed up recovery, so this test will pass consistently. |
| Comment by Yang Sheng [ 09/May/20 ] |
|
Hi, Andreas, I have working to find the root cause. The more stranger thing is that failover was triggered. Looks like some var was exported wrongly? cln..Failing mds1 on trevis-19vm4 CMD: trevis-19vm4 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-19vm4 CMD: trevis-19vm4 umount -d /mnt/lustre-mds1 CMD: trevis-19vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true Thanks, |
| Comment by Gerrit Updater [ 11/May/20 ] |
|
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38567 |
| Comment by Yang Sheng [ 20/May/20 ] |
|
So from stacktrace, This issue should duplicate with |
| Comment by Peter Jones [ 20/May/20 ] |
|
Hmm so did the fix for |
| Comment by Yang Sheng [ 20/May/20 ] |
|
Looks like it still exist. Further investigating. |
| Comment by Gerrit Updater [ 20/May/20 ] |
|
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38679 |
| Comment by Gerrit Updater [ 21/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38567/ |
| Comment by Andreas Dilger [ 21/May/20 ] |
|
After the latest patch landing, I'm seeing a couple of places where sanity test_133h is failing: files do not end with newline: - It is always '-' as the reported filename, so I suspect there is something wrong with how the test was modified, either reporting an error incorrectly, or not printing out the filename properly when a real error is hit. Two of the above failures were with review-ldiskfs-arm and two were on e2fsprogs tiny sessions. It isn't yet clear if these are permanent or intermittent failures, though I suspect they are permanent failures for those configurations, as I can't imagine how this test would be racy. |
| Comment by Yang Sheng [ 21/May/20 ] |
|
It is my fault. Since awk get input from stdin, So the FILENAME is '-'. But i want to know why this test case is exist? Looks like it only check whether output a '\n' char for get_param. Does it make sense? |
| Comment by Andreas Dilger [ 22/May/20 ] |
|
YS, can you please push a patch to fix FILENAME so we can see which filename is having the problem. |
| Comment by Gerrit Updater [ 22/May/20 ] |
|
Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38699 |
| Comment by Yang Sheng [ 22/May/20 ] |
|
From log:
[ 8673.618423] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param osc.lustre-OST0006-osc-ffff8b2112ff4000.rpc_stats | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.637811] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param timeout | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.656550] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param console_backoff | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8673.675441] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param console_max_delay_centisecs | awk -v FS=\v -v RS=\v\v 'END { if(NR>0 && $NF !~ /.*\n$/) print FILENAME}'
[ 8674.097330] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_133h: @@@@@@ FAIL: files do not end with newline: -
[ 8674.515181] Lustre: DEBUG MARKER: sanity test_133h: @@@@@@ FAIL: files do not end with newline: -
|
| Comment by Arshad Hussain [ 23/May/20 ] |
|
+1 On Master : https://testing.whamcloud.com/test_sets/619d15cb-a12b-489b-a085-cb473ad81772 |
| Comment by Bruno Faccini (Inactive) [ 28/May/20 ] |
|
After rebasing with current master, seems that patch-set #32 of my change @ https://review.whamcloud.com/#/c/35856/32 triggers |
| Comment by Peter Jones [ 28/May/20 ] |
|
I literally just hit post on a comment in Gerrit about this. There is a patch - https://review.whamcloud.com/#/c/38699/ - that you could try rebasing on top of |
| Comment by Bruno Faccini (Inactive) [ 28/May/20 ] |
|
Peter, thanks for your update. But looking at #38699, it looks as a pure debug patch. And according to the current failure rate of sanity/test_133h it should be easy for Yang Sheng to get the associated debug infos now. |
| Comment by Yang Sheng [ 28/May/20 ] |
|
Hi, Bruno, The latest 38699 is a fixing patch rather than debug patch. So please rebase your patch on top of it to overcome sanity-133h failure. Thanks, |
| Comment by Peter Jones [ 28/May/20 ] |
|
I am just going on that this has been used successfully for a couple of other patches that hit these failures. This patch is in master-next ATM so should land shortly. |
| Comment by Bruno Faccini (Inactive) [ 28/May/20 ] |
|
Oops my bad, Peter and Yang Sheng, I had missed the fix part of #38699. Thanks for your help. |
| Comment by Gerrit Updater [ 30/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/38699/ |
| Comment by Andreas Dilger [ 30/May/20 ] |
|
Patch has landed to master for 2.14. Patches will need to be rebased if they are failing because of this issue. |