[LU-3786] performance-sanity test_6: mkdir hung (sys_mkdir) Created: 20/Aug/13 Updated: 22/Mar/23 Resolved: 21/May/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jian Yu | Assignee: | Nathaniel Clark |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 22pl, zfs | ||
| Environment: |
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/ |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9793 | ||||||||||||||||
| Description |
|
performance-sanity test_6 hung as follows: 06:15:46:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files. 06:15:46:INFO: task mkdir:3833 blocked for more than 120 seconds. 06:15:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:15:46:mkdir D 0000000000000000 0 3833 3826 0x00000080 06:15:46: ffff88000e523dd8 0000000000000086 ffffffffa10ccf38 0000000000000000 06:15:46: ffff88000e523e48 ffff88007afe7540 ffff88007afe7540 ffff88007afe7540 06:15:46: ffff88007afe7af8 ffff88000e523fd8 000000000000fb88 ffff88007afe7af8 06:15:46:Call Trace: 06:15:46: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110 06:15:46: [<ffffffff8150f3de>] __mutex_lock_slowpath+0x13e/0x180 06:15:46: [<ffffffff8150f27b>] mutex_lock+0x2b/0x50 06:15:46: [<ffffffff8118e930>] lookup_create+0x30/0xd0 06:17:49: [<ffffffff8119221c>] sys_mkdirat+0x7c/0x130 06:17:49: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110 06:17:49: [<ffffffff810dc867>] ? audit_syscall_entry+0x1d7/0x200 06:17:49: [<ffffffff811922e8>] sys_mkdir+0x18/0x20 06:17:49: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 06:17:50:INFO: task mkdir:3835 blocked for more than 120 seconds. Maloo report: https://maloo.whamcloud.com/test_sets/8358068c-0962-11e3-ad8a-52540035b04c |
| Comments |
| Comment by Jian Yu [ 20/Aug/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/29/ The same failure: https://maloo.whamcloud.com/test_sets/20ed20e6-042a-11e3-a8e9-52540035b04c |
| Comment by Keith Mannthey (Inactive) [ 20/Aug/13 ] |
|
The autotest logs looks like mpi or something in the test is not configured correctly? ... -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found -------------------------------------------------------------------------- [[49900,1],6]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: wtm-15vm5.rosso.whamcloud.com Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found [wtm-15vm6.rosso.whamcloud.com:03820] 9 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [wtm-15vm6.rosso.whamcloud.com:03820] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Also the timestamps don't line up well on the Client 06:15:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files. 06:15:46:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files. 06:15:46:INFO: task mkdir:3833 blocked for more than 120 seconds. <========= How has this task ran for more than 120 seconds? It just started? 06:15:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:15:46:mkdir D 0000000000000000 0 3833 3826 0x00000080 06:15:46: ffff88000e523dd8 0000000000000086 ffffffffa10ccf38 0000000000000000 06:15:46: ffff88000e523e48 ffff88007afe7540 ffff88007afe7540 ffff88007afe7540 06:15:46: ffff88007afe7af8 ffff88000e523fd8 000000000000fb88 ffff88007afe7af8 06:15:46:Call Trace: 06:15:46: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110 06:15:46: [<ffffffff8150f3de>] __mutex_lock_slowpath+0x13e/0x180 06:15:46: [<ffffffff8150f27b>] mutex_lock+0x2b/0x50 06:15:46: [<ffffffff8118e930>] lookup_create+0x30/0xd0 06:17:49: [<ffffffff8119221c>] sys_mkdirat+0x7c/0x130 <==== Random 2 min jump |
| Comment by Andreas Dilger [ 21/Aug/13 ] |
|
This might be helped by Alex's patch http://review.whamcloud.com/7343. |
| Comment by Andreas Dilger [ 21/Aug/13 ] |
|
Keith, it might depend on where the timestamps are generated (e.g. stuck node or console logger) and/or when the console output is being sent. The "Random 2 min jump" lines up pretty well with "blocked for more than 120 seconds"... |
| Comment by Keith Mannthey (Inactive) [ 21/Aug/13 ] |
|
We are seeing alot of this on Rosso. I would expect to the 2 min gap between starting the test and the "blocked for more than 120 seconds message" not just during the printing of the backtrace. There is an open TT TT-1615 to look into this issue sometimes it is more than 2 min... Maybe it is just a weird console timestamp. |
| Comment by Jian Yu [ 01/Nov/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/46/ The same failure occurred: https://maloo.whamcloud.com/test_sets/b72e57a6-3061-11e3-b28a-52540035b04c |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ FSTYPE=zfs The same failure occurred: |
| Comment by Bob Glossman (Inactive) [ 04/Nov/13 ] |
|
Seen only on zfs and only on full tests and only on b2_4. |
| Comment by Bob Glossman (Inactive) [ 04/Nov/13 ] |
|
Think full tests is a red herring. Suspect performance-sanity is only run in full tests, not in other groups of tests. As far as I can tell all instances of both PASS and TIMEOUT are only in full tests. |
| Comment by Jian Yu [ 18/Nov/13 ] |
|
The same failure also occurred on master branch with FSTYPE=zfs: |
| Comment by Jian Yu [ 18/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/ The same failure occurred: |
| Comment by Jian Yu [ 27/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-reviews/19580/ The same failure occurred: https://maloo.whamcloud.com/test_sets/0951f362-4f2f-11e3-9462-52540035b04c 09:08:01:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6793 files. 09:08:01:INFO: task mkdir:17358 blocked for more than 120 seconds. 09:08:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 09:08:01:mkdir D 0000000000000000 0 17358 17349 0x00000080 09:08:01: ffff88006e74fdd8 0000000000000082 ffffffffa0ab7260 0000000000000000 09:08:01: ffff88006e74fe48 ffff880037bb9540 ffff880037bb9540 ffff880037bb9540 09:08:01: ffff880037bb9af8 ffff88006e74ffd8 000000000000fb88 ffff880037bb9af8 09:08:01:Call Trace: 09:08:01: [<ffffffff811a2100>] ? mntput_no_expire+0x30/0x110 09:08:01: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180 09:08:01: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50 09:08:01: [<ffffffff8118ebe0>] lookup_create+0x30/0xd0 09:08:01: [<ffffffff811924dc>] sys_mkdirat+0x7c/0x130 09:08:01: [<ffffffff811a2100>] ? mntput_no_expire+0x30/0x110 09:08:01: [<ffffffff810dc937>] ? audit_syscall_entry+0x1d7/0x200 09:08:01: [<ffffffff811925a8>] sys_mkdir+0x18/0x20 09:08:01: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comment by Jian Yu [ 16/Dec/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/67/ The same failure occurred: |
| Comment by Jian Yu [ 06/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/ The same failure occurred: |
| Comment by Jian Yu [ 10/Mar/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2) The same failure occurred: |
| Comment by Jian Yu [ 24/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ The same failure occurred: |
| Comment by Sarah Liu [ 17/Jun/15 ] |
|
lustre-master build #3071 https://testing.hpdd.intel.com/test_sets/925acf00-13e4-11e5-9804-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ] |
|
Another instance for EL6.7 Server/EL6.7 Client - ZFS |
| Comment by Sarah Liu [ 20/Jan/16 ] |
|
https://testing.hpdd.intel.com/test_sets/9abecea0-bc84-11e5-b3b7-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ] |
|
Another instance for FULL - EL6.7 Server/EL6.7 Client - ZFS , master, build# 3314. |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314 |
| Comment by Sarah Liu [ 10/Mar/17 ] |
|
also seen on master tag-2.9.53 RHEL7.3 zfs |