[LU-552] 1.8<->2.1 interop: parallel-scale connectathon test hung Created: 29/Jul/11  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.7

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_0_66_0
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)


Severity: 3
Rank (Obsolete): 10342

 Description   

parallel-scale connectathon test hung as follows:

Test #7 - Test parent/child mutual exclusion.
	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
	Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
	Parent: Now free child to run, should block on lock.
	Parent: Check data in file to insure child blocked.
	Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
	Parent: 7.1  - COMPARE [             ffc,               7] PASSED.
	Parent: Now unlock region so child will unblock.
	Parent: 7.2  - F_ULOCK [             ffc,               9] PASSED.

On client node fat-amd-3-ib:

[root@fat-amd-3-ib tests]# ps auxww
<~snip~>
root     16272  0.0  0.0 107268  2160 pts/0    S+   08:19   0:00 bash /usr/lib64/lustre/tests/parallel-scale.sh
root     16274  0.0  0.0 106020  1304 pts/0    S+   08:19   0:00 sh runtests -f
root     16281  0.0  0.0   6600   556 pts/0    S+   08:19   0:00 tlocklfs /mnt/lustre/d0.connectathon
root     16282  0.0  0.0   6436   332 pts/0    S+   08:19   0:00 tlocklfs /mnt/lustre/d0.connectathon

[root@fat-amd-3-ib tests]# echo t > /proc/sysrq-trigger
<~snip~>
tlocklfs      S 0000000000000004     0 16281  16274 0x00000080
 ffff880234c9dca8 0000000000000082 0000000000000000 0000000000000082
 ffff880234c9dc28 ffff8803d5c97cb8 0000000000000000 0000000101fbf9c8
 ffff8803190f7ab8 ffff880234c9dfd8 000000000000f598 ffff8803190f7ab8
Call Trace:
 [<ffffffff8117bf7b>] pipe_wait+0x5b/0x80
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff814dbc1e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8117c9d6>] pipe_read+0x3e6/0x4e0
 [<ffffffff811723ea>] do_sync_read+0xfa/0x140
 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811bc395>] ? fcntl_setlk+0x75/0x320
 [<ffffffff81204ef6>] ? security_file_permission+0x16/0x20
 [<ffffffff81172e15>] vfs_read+0xb5/0x1a0
 [<ffffffff810d1ac2>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81172f51>] sys_read+0x51/0x90
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

tlocklfs      S 0000000000000004     0 16282  16281 0x00000080
 ffff880105f5ba98 0000000000000086 00003f9affffff9d 0000020300003f9a
 0000000000000000 0000000000000001 ffff880105f5ba88 ffffffffa0789fb0
 ffff880104877ab8 ffff880105f5bfd8 000000000000f598 ffff880104877ab8
Call Trace:
 [<ffffffffa0789fb0>] ? ldlm_lock_dump+0x560/0x640 [ptlrpc]
 [<ffffffffa07b900d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa07a7565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc]
 [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
 [<ffffffffa07ab074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc]
 [<ffffffffa09a7edd>] ll_file_flock+0x47d/0x6b0 [lustre]
 [<ffffffff81190f40>] ? mntput_no_expire+0x30/0x110
 [<ffffffffa07b89f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc]
 [<ffffffff8117f451>] ? path_put+0x31/0x40
 [<ffffffff811bc243>] vfs_lock_file+0x23/0x40
 [<ffffffff811bc497>] fcntl_setlk+0x177/0x320
 [<ffffffff811845f7>] sys_fcntl+0x197/0x530
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

Dmesg on the MDS node fat-amd-1-ib showed that:

Lustre: DEBUG MARKER: == test connectathon: connectathon == 08:17:31
Lustre: DEBUG MARKER: ./runtests -N 10 -b -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 10 -g -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 10 -s -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.connectathon
Lustre: Service thread pid 27106 was inactive for 0.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Lustre: Service thread pid 27106 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Pid: 27106, comm: mdt_06

Call Trace: 
 [<ffffffffa09c16fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
 [<ffffffffa0c17b89>] ptlrpc_wait_event+0x2b9/0x2c0 [ptlrpc]
 [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
 [<ffffffffa0c1f6a5>] ptlrpc_main+0x4f5/0x1900 [ptlrpc]
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffffa0c1f1b0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/52ca975a-b9a5-11e0-8bdf-52540025f9af



 Comments   
Comment by Jian Yu [ 26/Aug/11 ]

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_1_0_0_RC1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/271/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)

Client mount options: "user_xattr,acl,flock"
connectathon also failed with same issue: https://maloo.whamcloud.com/test_sets/e2d7fa7c-cfc1-11e0-8d02-52540025f9af

Client mount options: "user_xattr,acl"
connectathon also failed: https://maloo.whamcloud.com/test_sets/9589e5d0-d1f3-11e0-8d02-52540025f9af

Client mount options: "user_xattr,acl,localflock"
connectathon passed: https://maloo.whamcloud.com/test_sets/86258fbe-d1f2-11e0-8d02-52540025f9af

Comment by Jian Yu [ 15/Feb/12 ]

Lustre Clients:
Tag: 1.8.7-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.12.1.el6)
Build: http://build.whamcloud.com/job/lustre-b1_8/171/
Network: TCP (1GigE)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_1_1_0_RC2
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6_lustre.g4554b65)
Build: http://build.whamcloud.com/job/lustre-b2_1/41/
Network: TCP (1GigE)

Client mount options: "user_xattr,acl,flock"
connectathon failed with same issue: https://maloo.whamcloud.com/test_sets/10fd5af6-57cf-11e1-99fa-5254004bbbd3

Comment by James A Simmons [ 16/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:08:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.