Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
Lustre 2.1.0, Lustre 1.8.6
-
None
-
Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
ENABLE_QUOTA=yes
Lustre Servers:
Tag: v2_0_66_0
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
Lustre Clients: Tag: 1.8.6-wc1 Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6) Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/ Network: IB (inkernel OFED) ENABLE_QUOTA=yes Lustre Servers: Tag: v2_0_66_0 Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre) Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/ Network: IB (inkernel OFED)
-
3
-
10342
Description
parallel-scale connectathon test hung as follows:
Test #7 - Test parent/child mutual exclusion. Parent: 7.0 - F_TLOCK [ ffc, 9] PASSED. Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ]. Parent: Now free child to run, should block on lock. Parent: Check data in file to insure child blocked. Parent: Read 'aaaa eh' from testfile [ 4092, 7 ]. Parent: 7.1 - COMPARE [ ffc, 7] PASSED. Parent: Now unlock region so child will unblock. Parent: 7.2 - F_ULOCK [ ffc, 9] PASSED.
On client node fat-amd-3-ib:
[root@fat-amd-3-ib tests]# ps auxww <~snip~> root 16272 0.0 0.0 107268 2160 pts/0 S+ 08:19 0:00 bash /usr/lib64/lustre/tests/parallel-scale.sh root 16274 0.0 0.0 106020 1304 pts/0 S+ 08:19 0:00 sh runtests -f root 16281 0.0 0.0 6600 556 pts/0 S+ 08:19 0:00 tlocklfs /mnt/lustre/d0.connectathon root 16282 0.0 0.0 6436 332 pts/0 S+ 08:19 0:00 tlocklfs /mnt/lustre/d0.connectathon [root@fat-amd-3-ib tests]# echo t > /proc/sysrq-trigger <~snip~> tlocklfs S 0000000000000004 0 16281 16274 0x00000080 ffff880234c9dca8 0000000000000082 0000000000000000 0000000000000082 ffff880234c9dc28 ffff8803d5c97cb8 0000000000000000 0000000101fbf9c8 ffff8803190f7ab8 ffff880234c9dfd8 000000000000f598 ffff8803190f7ab8 Call Trace: [<ffffffff8117bf7b>] pipe_wait+0x5b/0x80 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffff814dbc1e>] ? mutex_lock+0x1e/0x50 [<ffffffff8117c9d6>] pipe_read+0x3e6/0x4e0 [<ffffffff811723ea>] do_sync_read+0xfa/0x140 [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40 [<ffffffff811bc395>] ? fcntl_setlk+0x75/0x320 [<ffffffff81204ef6>] ? security_file_permission+0x16/0x20 [<ffffffff81172e15>] vfs_read+0xb5/0x1a0 [<ffffffff810d1ac2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81172f51>] sys_read+0x51/0x90 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b tlocklfs S 0000000000000004 0 16282 16281 0x00000080 ffff880105f5ba98 0000000000000086 00003f9affffff9d 0000020300003f9a 0000000000000000 0000000000000001 ffff880105f5ba88 ffffffffa0789fb0 ffff880104877ab8 ffff880105f5bfd8 000000000000f598 ffff880104877ab8 Call Trace: [<ffffffffa0789fb0>] ? ldlm_lock_dump+0x560/0x640 [ptlrpc] [<ffffffffa07b900d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc] [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 [<ffffffffa07a7565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc] [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20 [<ffffffffa07ab074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc] [<ffffffffa09a7edd>] ll_file_flock+0x47d/0x6b0 [lustre] [<ffffffff81190f40>] ? mntput_no_expire+0x30/0x110 [<ffffffffa07b89f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc] [<ffffffff8117f451>] ? path_put+0x31/0x40 [<ffffffff811bc243>] vfs_lock_file+0x23/0x40 [<ffffffff811bc497>] fcntl_setlk+0x177/0x320 [<ffffffff811845f7>] sys_fcntl+0x197/0x530 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Dmesg on the MDS node fat-amd-1-ib showed that:
Lustre: DEBUG MARKER: == test connectathon: connectathon == 08:17:31 Lustre: DEBUG MARKER: ./runtests -N 10 -b -f /mnt/lustre/d0.connectathon Lustre: DEBUG MARKER: ./runtests -N 10 -g -f /mnt/lustre/d0.connectathon Lustre: DEBUG MARKER: ./runtests -N 10 -s -f /mnt/lustre/d0.connectathon Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.connectathon Lustre: Service thread pid 27106 was inactive for 0.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Lustre: Service thread pid 27106 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Pid: 27106, comm: mdt_06 Call Trace: [<ffffffffa09c16fe>] cfs_waitq_wait+0xe/0x10 [libcfs] [<ffffffffa0c17b89>] ptlrpc_wait_event+0x2b9/0x2c0 [ptlrpc] [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20 [<ffffffffa0c1f6a5>] ptlrpc_main+0x4f5/0x1900 [ptlrpc] [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffffa0c1f1b0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Maloo report: https://maloo.whamcloud.com/test_sets/52ca975a-b9a5-11e0-8bdf-52540025f9af
Attachments
Issue Links
- Trackbacks
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Created Date: 20110820 The difference between RC0 and RC1 is only a date change in lustre/ChangeLog. Lustre 2.1....
-
Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC2 Tag: v2110RC2 Build:
Old blocker for unsupported version