[LU-427] Test failure on test suite lfsck Created: 17/Jun/11 Updated: 09/May/12 Resolved: 16/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.2, Lustre 1.8.8 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4722 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/e58b60d2-98ae-11e0-9a27-52540025f9af. |
| Comments |
| Comment by Peter Jones [ 17/Jun/11 ] |
|
Yang Sheng will look into this one |
| Comment by Peter Jones [ 25/Jun/11 ] |
|
Yang Sheng Any progress to report? This came up during 2.1 testing so we would like to understand it. Thanks Peter |
| Comment by Yang Sheng [ 27/Jun/11 ] |
|
Hi, Peter, I haven't found any obvious error from the log files. It looks like hard to reproduce. |
| Comment by Peter Jones [ 13/Jul/11 ] |
|
Sarah You have hit this same error (reported as LU439) for the last two 2.1 builds. Are you able to provide some more information to enable YangSheng to move forward on fixing it? Peter |
| Comment by Sarah Liu [ 13/Jul/11 ] |
|
Hi YangSheng, This bug can be reproduce on both the latest rhel5 and rhel6 x86_64(build #201) with quota enabled. please see the attached for more logs. If you need anything else, just let me know. |
| Comment by Yang Sheng [ 13/Jul/11 ] |
|
Many Thanks for the information. Sarah, I'll trying to repduce it on VM. |
| Comment by Yang Sheng [ 15/Jul/11 ] |
|
Hi, Sarah, Could you please attached '/tmp/mdsdb' to here if you hit it again. TIA. |
| Comment by Sarah Liu [ 18/Jul/11 ] |
|
please find mdsdb in the attached. |
| Comment by Yang Sheng [ 19/Jul/11 ] |
|
I don't find any wrong in this mdsdb file. VERSION=3 and this file can be used for manual check: [root@localhost git]# e2fsck -n --mdsdb mdsdb --ostdb odb /tmp/lustre-ost1 |
| Comment by Yang Sheng [ 27/Jul/11 ] |
|
This is a test environment issue. The /tmp should be a shared directory between the test nodes, but it isn't. I'll do further investigate for that. Thanks Sarah provide the test nodes. |
| Comment by Yang Sheng [ 27/Jul/11 ] |
|
Hi, Sarah, You can define SHARED_DIRECTORY to a shared directory to avoid failed on lfsck test. And this isn't a regression i think. |
| Comment by Sarah Liu [ 27/Jul/11 ] |
|
Hi Yang Sheng, /scratch on Toro is the shared directory which I use for all tests for a long time. I think the script should have the flexibility to use not only /tmp as a shared dir. [root@fat-intel-4 scratch]# touch aaa |
| Comment by Yang Sheng [ 02/Aug/11 ] |
|
The main problem is that lfsck has invoked the generate_db() to check whether a directory is shared or not. It should stop when /tmp isn't a shared. |
| Comment by Sarah Liu [ 02/Aug/11 ] |
|
Hi Yang Sheng, I think the problem is in lfsck.sh, it invokes init_test_env before reads the configuration file. In init_test_env, it exports SHARED_DIRECTORY, MDSDB and OSTDB as default values(/tmp, /tmp/mdsdb and /tmp/ostdb), while in config file, I explicit /scratch as the SHARED_DIRECTORY. That is the reason why generate_db() doesn't quit since at that time, SHARED_DIRECTORY is set to the correct one, but MDSDB and OSTDB still have the "wrong" values. |
| Comment by Yang Sheng [ 04/Aug/11 ] |
|
patch upload to: http://review.whamcloud.com/1180 |
| Comment by Jian Yu [ 09/Aug/11 ] |
|
If LFSCK_ALWAYS="yes", then lfsck would be run after each test suite which calls check_and_cleanup_lustre(). So, re-defining MDSDB and OSTDB only in lfsck.sh is not enough. I think maybe we could move the definitions of the following variables from init_test_env() to cfg/local.sh: # This is used by a small number of tests to share state between the client
# running the tests, or in some cases between the servers (e.g. lfsck.sh).
# It needs to be a non-lustre filesystem that is available on all the nodes.
export SHARED_DIRECTORY=${SHARED_DIRECTORY:-"/tmp"}
export MDSDB=${MDSDB:-$SHARED_DIRECTORY/mdsdb}
export OSTDB=${OSTDB:-$SHARED_DIRECTORY/ostdb}
And remove the following lines from cfg/ncli.sh since it sources cfg/local.sh: # This is used by a small number of tests to share state between the client
# running the tests, or in some cases between the servers (e.g. lfsck.sh).
# It needs to be a non-lustre filesystem that is available on all the nodes.
SHARED_DIRECTORY=${SHARED_DIRECTORY:-""} # bug 17839 comment 65
We also need add a function to verify that the $SHARED_DIRECTORY is a real shared directory among the test nodes in the Lustre cluster and What do you think of these changes, Sarah and Yang Sheng? |
| Comment by Yang Sheng [ 09/Aug/11 ] |
If LFSCK_ALWAYS="yes", then lfsck would be run after each test suite which calls check_and_cleanup_lustre(). So, re-defining MDSDB and OSTDB only in lfsck.sh is not enough. I think maybe we could move the definitions of the following variables from init_test_env() to cfg/local.sh:
# This is used by a small number of tests to share state between the client
# running the tests, or in some cases between the servers (e.g. lfsck.sh).
# It needs to be a non-lustre filesystem that is available on all the nodes.
export SHARED_DIRECTORY=${SHARED_DIRECTORY:-"/tmp"}
export MDSDB=${MDSDB:-$SHARED_DIRECTORY/mdsdb}
export OSTDB=${OSTDB:-$SHARED_DIRECTORY/ostdb}
And remove the following lines from cfg/ncli.sh since it sources cfg/local.sh:
# This is used by a small number of tests to share state between the client
# running the tests, or in some cases between the servers (e.g. lfsck.sh).
# It needs to be a non-lustre filesystem that is available on all the nodes.
SHARED_DIRECTORY=${SHARED_DIRECTORY:-""} # bug 17839 comment 65
If define MDSDB & OSTDB in cfg.sh then we needn't any change for it. I have not objection. We also need add a function to verify that the $SHARED_DIRECTORY is a real shared directory among the test nodes in the Lustre cluster and replace those codes (lfsck.sh, recovery-*-scale.sh, etc.) which simply check the variable of $SHARED_DIRECTORY is empty or not. Looks like generate_db() already able to do this check. |
| Comment by Sarah Liu [ 09/Aug/11 ] |
|
Is that workable that just changing the order of init_test_env and the call of configuration file? I mean, read in the config file first and then call the init_test_env |
| Comment by Yang Sheng [ 09/Aug/11 ] |
Is that workable that just changing the order of init_test_env and the call of configuration file? I mean, read in the config file first and then call the init_test_env Of course it works. But all the scripts called in this order. Seem read config file first and then call init_test_env more reasonable. I wonder who know the original intend for this order? |
| Comment by Jian Yu [ 10/Aug/11 ] |
It seemed that the function did not detect that /tmp was not a shared directory among the test nodes. We still need a common function to do the check since $SHARED_DIRECTORY is not only used for storing $MDSDB and $OSTDB, it's also used by recovery-*-scale.sh tests to catch failed clients.
I've no idea about this. After looking into init_test_env() and cfg/ {local,ncli}.sh, I found it seemed that only the following two stuff would be affected if we ran init_test_env after sourcing the config file: So, if we'd prefer to change the order, we need define those variables in cfg/local.sh and do more testing to see whether there are other changes need to be done. |
| Comment by Andreas Dilger [ 07/Feb/12 ] |
|
My memory of the last time I looked at switching the init_test_env() and the config file, I was in favor of switching it. This would allow the config file to drop the setting of default values, and only have to specify values that it wants to change from the default. Then init_test_env() can supply all of the unset values, and use those for setting its own defaults. In fact, it would be even better if init_test_env() handled reading the config file, or that the config file called init_test_env() itself to set the defaults. Reducing the per-script initialization to a single line would reduce the complexity and duplication in all of the tests. I see something similar to the following lines copied to the start of each test: LUSTRE=${LUSTRE:-`dirname $0`/..} That said, I'd be happier to have the lfsck.sh test failure fixed quickly in a simple patch, and then a separate patch which is only switching the order of the test initialization. That reduces the risk and complexity of the fix for the failing lfsck.sh test, and also ensures that tests are passing sooner instead of later and avoids blocking other patches from landing for a longer time. |
| Comment by Jian Yu [ 10/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC1 The same issue occurred: |
| Comment by Peter Jones [ 16/Feb/12 ] |
|
Landed for 2.2 |
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jian Yu [ 10/Apr/12 ] |
|
Patch for b1_8 is in http://review.whamcloud.com/2498. |
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 16/Apr/12 ] |
|
Integrated in Result = SUCCESS
|