[LU-1661] Test failure on test suite posix, subtest test_1 Created: 23/Jul/12  Updated: 26/Aug/12  Resolved: 26/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4474

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/24c171c6-d2d6-11e1-8823-52540035b04c.

The sub-test test_1 failed with the following error:

test failed to respond and timed out



 Comments   
Comment by Peter Jones [ 24/Jul/12 ]

Minh can you please comment?

Comment by Minh Diep [ 01/Aug/12 ]

Unfortunately, the logs do not have any useful information. the "hung" is during mkfs on the loopback fs. I found a few instances of this failure. I am investigating this further

https://maloo.whamcloud.com/test_sets/query?utf8=%E2%9C%93&test_set%5Btest_set_script_id%5D=f60ca966-8b52-11e0-aab9-52540025f9af&test_set%5Bstatus%5D=TIMEOUT&test_set%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

Comment by Peter Jones [ 02/Aug/12 ]

Yujian

Could you please also look into this one while you are investigating LU-1580?

Thanks

Peter

Comment by Jian Yu [ 03/Aug/12 ]

Unfortunately, the logs do not have any useful information. the "hung" is during mkfs on the loopback fs. I found a few instances of this failure. I am investigating this further

In fact, the test did not hang. It succeeded in running "setup_loop_dev", and went into "setup_posix_users", where the following issue occurred:

Syslog of client-27vm6

Jul 20 14:34:54 client-27vm6 kernel: Lustre: DEBUG MARKER: == posix test 1: build, install, run posix on ext4 and lustre, then compare ========================== 14:34:53 (1342820093)
Jul 20 14:35:01 client-27vm6 rshd[10120]: pam_unix(rsh:session): session closed for user root
Jul 20 14:35:20 client-27vm6 xinetd[1915]: EXIT: shell status=0 pid=10120 duration=0(sec)
Jul 20 14:35:21 client-27vm6 kernel: loop: loaded (max 8 devices)
Jul 20 14:35:21 client-27vm6 kernel: EXT4-fs (loop1): mounted filesystem with ordered data mode
Jul 20 14:35:21 client-27vm6 xinetd[1915]: START: shell pid=10294 from=10.10.4.159
Jul 20 14:35:21 client-27vm6 rshd[10294]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 20 14:35:21 client-27vm6 rshd[10294]: pam_rhosts_auth(rsh:auth): allowed to root@client-27vm6.lab.whamcloud.com as root
Jul 20 14:35:21 client-27vm6 rshd[10294]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 20 14:35:21 client-27vm6 rshd[10295]: root@client-27vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr
/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vs
xg0 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1001 ] || \ error 1 \"inconsistent group ID: new: 1001, old: \$gid\"; e
lse groupadd -g 1001 vsxg0 fi;");echo XXRETCODE:$?'
Jul 20 14:35:21 client-27vm6 groupadd[10332]: new group: name=vsxg0, GID=1001
Jul 20 14:35:21 client-27vm6 rshd[10294]: pam_unix(rsh:session): session closed for user root
Jul 20 14:35:22 client-27vm6 xinetd[1915]: EXIT: shell status=0 pid=10294 duration=1(sec)
Jul 20 14:40:21 client-27vm6 xinetd[1915]: START: shell pid=10367 from=10.10.4.159
Jul 20 14:40:21 client-27vm6 rshd[10367]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 20 14:40:22 client-27vm6 rshd[10367]: pam_rhosts_auth(rsh:auth): allowed to root@client-27vm6.lab.whamcloud.com as root
Jul 20 14:40:22 client-27vm6 rshd[10367]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 20 14:40:22 client-27vm6 rshd[10368]: root@client-27vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg1 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1002 ] || \ error 1 \"inconsistent group ID: new: 1002, old: \$gid\"; else groupadd -g 1002 vsxg1 fi;");echo XXRETCODE:$?'
Jul 20 14:40:22 client-27vm6 groupadd[10405]: new group: name=vsxg1, GID=1002
Jul 20 14:40:22 client-27vm6 rshd[10367]: pam_unix(rsh:session): session closed for user root
Jul 20 14:40:22 client-27vm6 xinetd[1915]: EXIT: shell status=0 pid=10367 duration=1(sec)
Jul 20 14:45:21 client-27vm6 xinetd[1915]: START: shell pid=10440 from=10.10.4.159
Jul 20 14:45:22 client-27vm6 rshd[10440]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 20 14:45:22 client-27vm6 rshd[10440]: pam_rhosts_auth(rsh:auth): allowed to root@client-27vm6.lab.whamcloud.com as root
Jul 20 14:45:22 client-27vm6 rshd[10440]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 20 14:45:22 client-27vm6 rshd[10441]: root@client-27vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg2 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1003 ] || \ error 1 \"inconsistent group ID: new: 1003, old: \$gid\"; else groupadd -g 1003 vsxg2 fi;");echo XXRETCODE:$?'
Jul 20 14:45:22 client-27vm6 groupadd[10478]: new group: name=vsxg2, GID=1003

<~snip~>

Jul 20 15:30:27 client-27vm6 groupadd[11144]: new group: name=supp9, GID=1012
Jul 20 15:30:27 client-27vm6 rshd[11106]: pam_unix(rsh:session): session closed for user root
Jul 20 15:30:27 client-27vm6 xinetd[1915]: EXIT: shell status=0 pid=11106 duration=1(sec)
Jul 20 15:35:28 client-27vm6 xinetd[1915]: START: shell pid=11179 from=10.10.4.159
Jul 20 15:35:29 client-27vm6 rshd[11179]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 20 15:35:30 client-27vm6 rshd[11179]: pam_rhosts_auth(rsh:auth): allowed to root@client-27vm6.lab.whamcloud.com as root
Jul 20 15:35:30 client-27vm6 rshd[11179]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 20 15:35:31 client-27vm6 rshd[11180]: root@client-27vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr
/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group su
pp10 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1013 ] || \ error 1 \"inconsistent group ID: new: 1013, old: \$gid\";
else groupadd -g 1013 supp10 fi;");echo XXRETCODE:$?'
Jul 20 15:35:31 client-27vm6 groupadd[11217]: new group: name=supp10, GID=1013
Jul 20 15:35:32 client-27vm6 rshd[11179]: pam_unix(rsh:session): session closed for user root
Jul 20 15:35:32 client-27vm6 xinetd[1915]: EXIT: shell status=0 pid=11179 duration=2(sec)
Jul 20 15:35:38 client-27vm6 sshd[11221]: Accepted publickey for autotest from 10.10.0.13 port 50279 ssh2
Jul 20 15:35:39 client-27vm6 sshd[11221]: pam_unix(sshd:session): session opened for user autotest by (uid=0)
Jul 20 15:35:40 client-27vm6 sshd[11223]: Received disconnect from 10.10.0.13: 11: disconnected by user
Jul 20 15:35:40 client-27vm6 sshd[11221]: pam_unix(sshd:session): session closed for user autotest
Jul 20 15:35:42 client-27vm6 sshd[11226]: Accepted publickey for autotest from 10.10.0.13 port 50281 ssh2
Jul 20 15:35:42 client-27vm6 sshd[11226]: pam_unix(sshd:session): session opened for user autotest by (uid=0)
Jul 20 15:35:43 client-27vm6 sshd[11228]: Received disconnect from 10.10.0.13: 11: disconnected by user
Jul 20 15:35:44 client-27vm6 sshd[11226]: pam_unix(sshd:session): session closed for user autotest
Jul 20 15:35:48 client-27vm6 sshd[11231]: Accepted publickey for autotest from 10.10.0.13 port 50282 ssh2
Jul 20 15:35:49 client-27vm6 sshd[11231]: pam_unix(sshd:session): session opened for user autotest by (uid=0)
Jul 20 15:35:50 client-27vm6 xinetd[1915]: START: shell pid=11235 from=10.10.4.159
Jul 20 15:35:51 client-27vm6 rshd[11235]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 20 15:35:51 client-27vm6 rshd[11235]: pam_rhosts_auth(rsh:auth): allowed to autotest@client-27vm6.lab.whamcloud.com as root
Jul 20 15:35:52 client-27vm6 rshd[11235]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 20 15:35:52 client-27vm6 rshd[11236]: autotest@client-27vm6.lab.whamcloud.com as root: cmd='PATH=$PATH:/sbin:/usr/sbin;echo t
> /proc/sysrq-trigger'
Jul 20 15:35:53 client-27vm6 kernel: SysRq : Show State

There was 5 minutes interval related to xinetd[1915] after adding each new group. Finally, autotest terminated the test after running 3600 seconds.

Comment by Jian Yu [ 06/Aug/12 ]

After looking into the syslogs of other instances, I found the 5 minutes interval was related to add_group():

Syslog of client-19vm1

Jul 15 02:21:37 client-19vm1 kernel: Lustre: DEBUG MARKER: == posix test 1: build, install, run posix on ext4 and lustre, then compare ========================== 02:21:36 (1342344096)
Jul 15 02:21:37 client-19vm1 rshd[8144]: pam_unix(rsh:session): session closed for user root
Jul 15 02:21:37 client-19vm1 xinetd[1921]: EXIT: shell status=0 pid=8144 duration=0(sec)
Jul 15 02:21:37 client-19vm1 kernel: loop: loaded (max 8 devices)
Jul 15 02:22:00 client-19vm1 kernel: EXT4-fs (loop1): mounted filesystem with ordered data mode
Jul 15 02:22:03 client-19vm1 xinetd[1921]: START: shell pid=8316 from=10.10.4.220
Jul 15 02:22:03 client-19vm1 rshd[8316]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 15 02:22:03 client-19vm1 rshd[8316]: pam_rhosts_auth(rsh:auth): allowed to root@client-19vm1.lab.whamcloud.com as root
Jul 15 02:22:03 client-19vm1 rshd[8316]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 15 02:22:04 client-19vm1 rshd[8317]: root@client-19vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg0 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1001 ] || \ error 1 \"inconsistent group ID: new: 1001, old: \$gid\"; else groupadd -g 1001 vsxg0 fi;");echo XXRETCODE:$?'
Jul 15 02:27:04 client-19vm1 groupadd[8354]: new group: name=vsxg0, GID=1001
Jul 15 02:27:04 client-19vm1 rshd[8316]: pam_unix(rsh:session): session closed for user root
Jul 15 02:27:04 client-19vm1 xinetd[1921]: EXIT: shell status=0 pid=8316 duration=304(sec)
Jul 15 02:27:04 client-19vm1 xinetd[1921]: START: shell pid=8389 from=10.10.4.220
Jul 15 02:27:04 client-19vm1 rshd[8389]: pam_rhosts_auth(rsh:auth): user root has a `+' user entry
Jul 15 02:27:05 client-19vm1 rshd[8389]: pam_rhosts_auth(rsh:auth): allowed to root@client-19vm1.lab.whamcloud.com as root
Jul 15 02:27:05 client-19vm1 rshd[8389]: pam_unix(rsh:session): session opened for user root by (uid=0)
Jul 15 02:27:05 client-19vm1 rshd[8390]: root@client-19vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg1 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1002 ] || \ error 1 \"inconsistent group ID: new: 1002, old: \$gid\"; else groupadd -g 1002 vsxg1 fi;");echo XXRETCODE:$?'
Jul 15 02:32:04 client-19vm1 groupadd[8427]: new group: name=vsxg1, GID=1002

<~snip~>

Comment by Peter Jones [ 06/Aug/12 ]

Yujian commented "it seems it's related to the network issue of the test cluster. The issue has not occurred since 2012-07-21 (there are more than 30 POSIX test runs after that). " so I think that we can drop the priority of this one so it is not longer tracked as a blocker for 2.3

Comment by Jian Yu [ 20/Aug/12 ]

Another instance:
https://maloo.whamcloud.com/test_sets/a23346b8-e96d-11e1-a508-52540035b04c

Syslog of fat-intel-3vm6

Aug 18 10:04:37 fat-intel-3vm6 rshd[2588]: root@fat-intel-3vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /
usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs FSTYPE=ldiskfs sh -c " error() { set +x; echo Error:
\$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg0 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1001 ] || \
error 1 \"inconsistent group ID: new: 1001, old: \$gid\"; else groupadd -g 1001 vsxg0 fi;");echo XXRETCODE:$?'
Aug 18 10:09:37 fat-intel-3vm6 groupadd[2629]: new group: name=vsxg0, GID=1001

The issue is that why it took 5 minutes to add one group on a remote node? Normally, the same operation only took less than 1 second:

Syslog of client-27vm2

Aug 3 14:02:49 client-27vm2 rshd[32413]: root@client-27vm2.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" USE_OFD= MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs FSTYPE=ldiskfs sh -c " error() { set +x; echo Error: \$2: \$1; echo XXRETCODE:\$1; exit \$1; } gid=\$(getent group vsxg0 | cut -d: -f3); if [ "x\$gid" != "x" ]; then [ \$gid -eq 1001 ] || \ error 1 \"inconsistent group ID: new: 1001, old: \$gid\"; else groupadd -g 1001 vsxg0 fi;");echo XXRETCODE:$?'
Aug 3 14:02:49 client-27vm2 kernel: [ $gid -eq 1001 ] || error 1 "inconsistent group ID: new: 1001, old:
Aug 3 14:02:49 client-27vm2 groupadd[32450]: new group: name=vsxg0, GID=1001

In addition, from the historical reports, I found the issue only occurred on RHEL5 clients. There were also passed instances on RHEL5 clients without the above network latency time issue.

Comment by Jian Yu [ 24/Aug/12 ]

I can not manually reproduce the above issue on RHEL5 clients on Toro test nodes.

To simplify the add_group() and add_user() in posix.cfg, I created the following patches to improve the setup_posix_users() in posix.cfg to use do_rpc_nodes to add groups and users on remote nodes.

Patch for b2_3 branch: http://review.whamcloud.com/3771
Patch for master branch: http://review.whamcloud.com/3770

Comment by Peter Jones [ 26/Aug/12 ]

Landed for 2.3 and 2.4

Generated at Sat Feb 10 01:18:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.