[LU-1661] Test failure on test suite posix, subtest test_1 Created: 23/Jul/12 Updated: 26/Aug/12 Resolved: 26/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4474 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/24c171c6-d2d6-11e1-8823-52540035b04c. The sub-test test_1 failed with the following error:
|
| Comments |
| Comment by Peter Jones [ 24/Jul/12 ] |
|
Minh can you please comment? |
| Comment by Minh Diep [ 01/Aug/12 ] |
|
Unfortunately, the logs do not have any useful information. the "hung" is during mkfs on the loopback fs. I found a few instances of this failure. I am investigating this further |
| Comment by Peter Jones [ 02/Aug/12 ] |
|
Yujian Could you please also look into this one while you are investigating Thanks Peter |
| Comment by Jian Yu [ 03/Aug/12 ] |
In fact, the test did not hang. It succeeded in running "setup_loop_dev", and went into "setup_posix_users", where the following issue occurred: Syslog of client-27vm6 Jul 20 14:34:54 client-27vm6 kernel: Lustre: DEBUG MARKER: == posix test 1: build, install, run posix on ext4 and lustre, then compare ========================== 14:34:53 (1342820093) <~snip~> Jul 20 15:30:27 client-27vm6 groupadd[11144]: new group: name=supp9, GID=1012 There was 5 minutes interval related to xinetd[1915] after adding each new group. Finally, autotest terminated the test after running 3600 seconds. |
| Comment by Jian Yu [ 06/Aug/12 ] |
|
After looking into the syslogs of other instances, I found the 5 minutes interval was related to add_group(): Syslog of client-19vm1 Jul 15 02:21:37 client-19vm1 kernel: Lustre: DEBUG MARKER: == posix test 1: build, install, run posix on ext4 and lustre, then compare ========================== 02:21:36 (1342344096) <~snip~> |
| Comment by Peter Jones [ 06/Aug/12 ] |
|
Yujian commented "it seems it's related to the network issue of the test cluster. The issue has not occurred since 2012-07-21 (there are more than 30 POSIX test runs after that). " so I think that we can drop the priority of this one so it is not longer tracked as a blocker for 2.3 |
| Comment by Jian Yu [ 20/Aug/12 ] |
|
Another instance: Syslog of fat-intel-3vm6 Aug 18 10:04:37 fat-intel-3vm6 rshd[2588]: root@fat-intel-3vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd / The issue is that why it took 5 minutes to add one group on a remote node? Normally, the same operation only took less than 1 second: Syslog of client-27vm2 Aug 3 14:02:49 client-27vm2 rshd[32413]: root@client-27vm2.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs FSTYPE=ldiskfs sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg0 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1001 ] || \ error 1 \"inconsistent group ID: new: 1001, old: \$gid\"; else groupadd -g 1001 vsxg0 fi;");echo XXRETCODE:$?' In addition, from the historical reports, I found the issue only occurred on RHEL5 clients. There were also passed instances on RHEL5 clients without the above network latency time issue. |
| Comment by Jian Yu [ 24/Aug/12 ] |
|
I can not manually reproduce the above issue on RHEL5 clients on Toro test nodes. To simplify the add_group() and add_user() in posix.cfg, I created the following patches to improve the setup_posix_users() in posix.cfg to use do_rpc_nodes to add groups and users on remote nodes. Patch for b2_3 branch: http://review.whamcloud.com/3771 |
| Comment by Peter Jones [ 26/Aug/12 ] |
|
Landed for 2.3 and 2.4 |