[LU-480] Test failure on test suite sanityn: LBUG/LASSERT detected Created: 04/Jul/11  Updated: 07/Jul/11  Resolved: 07/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6590

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/8771913e-a451-11e0-b786-52540025f9af.

The sub-test test_36 failed with the following error:

LBUG/LASSERT detected



 Comments   
Comment by Jian Yu [ 04/Jul/11 ]

The issue occurred on many sanityn sub-tests:
https://maloo.whamcloud.com/test_sets/419609a4-a4fd-11e0-b786-52540025f9af
https://maloo.whamcloud.com/test_sets/bda31ee8-a204-11e0-aee5-52540025f9af
https://maloo.whamcloud.com/test_sets/3791c1d0-a1df-11e0-aee5-52540025f9af

Comment by Zhenyu Xu [ 04/Jul/11 ]

Chris,

What is the pdsh full command being used in the autotest?

Comment by Chris Gearing (Inactive) [ 04/Jul/11 ]

You'll have to give me a bit more info to enable me to understand that
question. Which pdsh command when? If you asking about the pdsh command
used for test 41g then that will be the whatever the standard scripts
create. The test environment will not change or customize it?

What other output could help you?

Comment by Zhenyu Xu [ 04/Jul/11 ]

those error was due to pdsh command times out, and pdsh IMO is set when we call auster.sh.

Yu Jian points out that there is a autotest_config.sh or similar script which she saw in

Provisioning-1.Provision_1.autotest.client-21vm1.log:
pdsh -w client-21vm1,client-21vm2 scp client-21vm1:/root/autotest_config.sh /usr/lib64/lustre/tests/cfg/

I suggest we add "-t 120" option for pdsh, like "pdsh -w -t 120 xxxxxxxxxxxxxxxx" in it. pdsh default use 10 seconds as timeout, -t 120 can extend it to 120 seconds.

Comment by Jian Yu [ 04/Jul/11 ]

Per this provisioning log: https://maloo.whamcloud.com/test_logs/bcb00110-a450-11e0-b786-52540025f9af/show_text

12:00:25:(/home/autotest/lab/tools/run_test.sh:55): 
12:00:25:cat /home/autotest/lab/tools/mecturk.sh
12:00:25:((/home/autotest/lab/tools/run_test.sh:57): 
12:00:25:hostname -s
12:00:25:(/home/autotest/lab/tools/run_test.sh:57): 
12:00:25:CPYLINE='scp client-21vm1:/root/autotest_config.sh /usr/lib64/lustre/tests/cfg/'
12:00:27:(/home/autotest/lab/tools/run_test.sh:58): 
12:00:27:pdsh -w client-21vm1,client-21vm2 scp client-21vm1:/root/autotest_config.sh /usr/lib64/lustre/tests/cfg/

The PDSH variable was set in /home/autotest/lab/tools/mecturk.sh:

$ grep PDSH /home/autotest/lab/tools/mecturk.sh 
PDSH="pdsh -S -Rrsh -w"

Chris, could you please change it to "pdsh -t 120 -S -Rrsh -w" to see whether the upcoming autotest results hit those timeout issues or not?

Comment by Chris Gearing (Inactive) [ 04/Jul/11 ]

OK, I've change the pdsh in mecturk, let's see if it stops failing.

Comment by Peter Jones [ 07/Jul/11 ]

Closing for now. Reopen if this reoccurs with the change in place

Generated at Sat Feb 10 01:07:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.