Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.1.3
-
None
-
3
-
10076
Description
Newly deployed filesystem. Running initial test rsync from client to filesystem when rsync process hung.
Also see other weirdness. Poor performance. Network errors (example below).
Lustre: 16691:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347583367/real 1347583367] req@ffff880664530c00 x1413043208599797/t0(0) o8->alicefs-OST0006-osc-ffff880832436000@172.20.2.125@o2ib:28/4 lens 368/512 e 0 to 1 dl 1347583422 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 16691:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: 16691:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347583517/real 1347583517] req@ffff8808329b1800 x1413043208681668/t0(0) o8->alicefs-OST0006-osc-ffff880832436000@172.20.2.125@o2ib:28/4 lens 368/512 e 0 to 1 dl 1347583572 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 16691:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
INFO: task rsync:16826 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rsync D 0000000000000002 0 16826 1 0x00000080
ffff8803df615978 0000000000000046 0000000000000000 ffff88043343bca0
ffff8808340ecb58 ffff880833270b40 ffff88043343bd38 ffff8803df615938
ffff88043405b058 ffff8803df615fd8 000000000000fb88 ffff88043405b058
Call Trace:
[<ffffffffa04c067e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa0a6c314>] lov_delete_raid0+0x2e4/0x400 [lov]
[<ffffffffa0a6b816>] lov_object_delete+0x66/0x190 [lov]
[<ffffffffa05e2d29>] lu_object_free+0x89/0x1b0 [obdclass]
[<ffffffffa04ce894>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa04cf052>] ? cfs_hash_bd_from_key+0x42/0xe0 [libcfs]
[<ffffffffa05e300f>] lu_object_put+0x1bf/0x290 [obdclass]
[<ffffffffa0b4c312>] ? cl_inode_fini+0x52/0x270 [lustre]
[<ffffffffa05e9a1e>] cl_object_put+0xe/0x10 [obdclass]
[<ffffffffa0b4c34a>] cl_inode_fini+0x8a/0x270 [lustre]
[<ffffffffa0c1cbec>] ? lmv_change_cbdata+0x2cc/0x580 [lmv]
[<ffffffffa0b1c42f>] ll_clear_inode+0x2ff/0xdb0 [lustre]
[<ffffffff81195c0c>] clear_inode+0xac/0x140
[<ffffffffa0b1747b>] ll_delete_inode+0x18b/0x650 [lustre]
[<ffffffffa0b172f0>] ? ll_delete_inode+0x0/0x650 [lustre]
[<ffffffff8119630e>] generic_delete_inode+0xde/0x1d0
[<ffffffff81196465>] generic_drop_inode+0x65/0x80
[<ffffffff811952b2>] iput+0x62/0x70
[<ffffffffa0af38d1>] ll_d_iput+0x31/0x80 [lustre]
[<ffffffff81191dfc>] dentry_iput+0x7c/0x100
[<ffffffff81191f71>] d_kill+0x31/0x60
[<ffffffff8119399c>] dput+0x7c/0x150
[<ffffffff8117caf9>] __fput+0x189/0x210
[<ffffffff8117cba5>] fput+0x25/0x30
[<ffffffff811785cd>] filp_close+0x5d/0x90
[<ffffffff8106e3bf>] put_files_struct+0x7f/0xf0
[<ffffffff8106e483>] exit_files+0x53/0x70
[<ffffffff810704fd>] do_exit+0x18d/0x870
[<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81070cc7>] sys_exit_group+0x17/0x20
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task rsync:16826 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rsync D 0000000000000002 0 16826 1 0x00000080
ffff8803df615978 0000000000000046 0000000000000000 ffff88043343bca0
ffff8808340ecb58 ffff880833270b40 ffff88043343bd38 ffff8803df615938
ffff88043405b058 ffff8803df615fd8 000000000000fb88 ffff88043405b058
Call Trace:
[<ffffffffa04c067e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa0a6c314>] lov_delete_raid0+0x2e4/0x400 [lov]
[<ffffffffa0a6b816>] lov_object_delete+0x66/0x190 [lov]
[<ffffffffa05e2d29>] lu_object_free+0x89/0x1b0 [obdclass]
[<ffffffffa04ce894>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa04cf052>] ? cfs_hash_bd_from_key+0x42/0xe0 [libcfs]
[<ffffffffa05e300f>] lu_object_put+0x1bf/0x290 [obdclass]
[<ffffffffa0b4c312>] ? cl_inode_fini+0x52/0x270 [lustre]
[<ffffffffa05e9a1e>] cl_object_put+0xe/0x10 [obdclass]
[<ffffffffa0b4c34a>] cl_inode_fini+0x8a/0x270 [lustre]
[<ffffffffa0c1cbec>] ? lmv_change_cbdata+0x2cc/0x580 [lmv]
[<ffffffffa0b1c42f>] ll_clear_inode+0x2ff/0xdb0 [lustre]
[<ffffffff81195c0c>] clear_inode+0xac/0x140
[<ffffffffa0b1747b>] ll_delete_inode+0x18b/0x650 [lustre]
[<ffffffffa0b172f0>] ? ll_delete_inode+0x0/0x650 [lustre]
[<ffffffff8119630e>] generic_delete_inode+0xde/0x1d0
[<ffffffff81196465>] generic_drop_inode+0x65/0x80
[<ffffffff811952b2>] iput+0x62/0x70
[<ffffffffa0af38d1>] ll_d_iput+0x31/0x80 [lustre]
[<ffffffff81191dfc>] dentry_iput+0x7c/0x100
[<ffffffff81191f71>] d_kill+0x31/0x60
[<ffffffff8119399c>] dput+0x7c/0x150
[<ffffffff8117caf9>] __fput+0x189/0x210
[<ffffffff8117cba5>] fput+0x25/0x30
[<ffffffff811785cd>] filp_close+0x5d/0x90
[<ffffffff8106e3bf>] put_files_struct+0x7f/0xf0
[<ffffffff8106e483>] exit_files+0x53/0x70
[<ffffffff810704fd>] do_exit+0x18d/0x870
[<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff81070c38>] do_group_exit+0x58/0xd0
[<ffffffff81070cc7>] sys_exit_group+0x17/0x20
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Doug, I recompiled Lustre 2.1.3 source against Mellanox OFED 1.5.3-3.1.0. With the server and client side running the same IB stack it appears there are no problems.
I had seen this once before in RHEL5.5, Lustre 1.8.4 where the client/cluster side ran OFED and the Lustre server side ran RHEL's in-kernel IB and the lnet communication was dysfunctional. As we know, standards aren't really that standard.
I am going to test this on our in-house cluster when I get back to the US. I think there is a lingering incompatibility between RHEL's IB and clients running OFED. I will pass my findings along to WC.
Since OFED is so popular on the cluster node side of the equation perhaps WC might consider doing an OFED variant of the precompiled Lustre packages. Just a thought.