[LU-3786] performance-sanity test_6: mkdir hung (sys_mkdir) Created: 20/Aug/13  Updated: 22/Mar/23  Resolved: 21/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Nathaniel Clark
Resolution: Cannot Reproduce Votes: 0
Labels: 22pl, zfs
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/
FSTYPE=zfs


Issue Links:
Related
is related to LU-16658 performance sanity test_6 mdsrate-loo... Resolved
is related to LU-3753 MDS operation rate dropping sporadically Resolved
is related to LU-4427 performance-sanity test_6: mdsrate hu... Resolved
Severity: 3
Rank (Obsolete): 9793

 Description   

performance-sanity test_6 hung as follows:

06:15:46:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files.
06:15:46:INFO: task mkdir:3833 blocked for more than 120 seconds.
06:15:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
06:15:46:mkdir         D 0000000000000000     0  3833   3826 0x00000080
06:15:46: ffff88000e523dd8 0000000000000086 ffffffffa10ccf38 0000000000000000
06:15:46: ffff88000e523e48 ffff88007afe7540 ffff88007afe7540 ffff88007afe7540
06:15:46: ffff88007afe7af8 ffff88000e523fd8 000000000000fb88 ffff88007afe7af8
06:15:46:Call Trace:
06:15:46: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110
06:15:46: [<ffffffff8150f3de>] __mutex_lock_slowpath+0x13e/0x180
06:15:46: [<ffffffff8150f27b>] mutex_lock+0x2b/0x50
06:15:46: [<ffffffff8118e930>] lookup_create+0x30/0xd0
06:17:49: [<ffffffff8119221c>] sys_mkdirat+0x7c/0x130
06:17:49: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110
06:17:49: [<ffffffff810dc867>] ? audit_syscall_entry+0x1d7/0x200
06:17:49: [<ffffffff811922e8>] sys_mkdir+0x18/0x20
06:17:49: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
06:17:50:INFO: task mkdir:3835 blocked for more than 120 seconds.

Maloo report: https://maloo.whamcloud.com/test_sets/8358068c-0962-11e3-ad8a-52540035b04c



 Comments   
Comment by Jian Yu [ 20/Aug/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/29/
FSTYPE=zfs

The same failure: https://maloo.whamcloud.com/test_sets/20ed20e6-042a-11e3-a8e9-52540035b04c

Comment by Keith Mannthey (Inactive) [ 20/Aug/13 ]

The autotest logs looks like mpi or something in the test is not configured correctly?

...
--------------------------------------------------------------------------
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: plm_rsh_agent
--------------------------------------------------------------------------
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
--------------------------------------------------------------------------
[[49900,1],6]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: wtm-15vm5.rosso.whamcloud.com

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
[wtm-15vm6.rosso.whamcloud.com:03820] 9 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[wtm-15vm6.rosso.whamcloud.com:03820] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Also the timestamps don't line up well on the Client

06:15:46:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files.
06:15:46:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6904 files.
06:15:46:INFO: task mkdir:3833 blocked for more than 120 seconds.                       <=========  How has this task ran for more than 120 seconds?  It just started?
06:15:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
06:15:46:mkdir         D 0000000000000000     0  3833   3826 0x00000080
06:15:46: ffff88000e523dd8 0000000000000086 ffffffffa10ccf38 0000000000000000
06:15:46: ffff88000e523e48 ffff88007afe7540 ffff88007afe7540 ffff88007afe7540
06:15:46: ffff88007afe7af8 ffff88000e523fd8 000000000000fb88 ffff88007afe7af8
06:15:46:Call Trace:
06:15:46: [<ffffffff811a1c00>] ? mntput_no_expire+0x30/0x110
06:15:46: [<ffffffff8150f3de>] __mutex_lock_slowpath+0x13e/0x180
06:15:46: [<ffffffff8150f27b>] mutex_lock+0x2b/0x50
06:15:46: [<ffffffff8118e930>] lookup_create+0x30/0xd0
06:17:49: [<ffffffff8119221c>] sys_mkdirat+0x7c/0x130    <==== Random 2 min jump
Comment by Andreas Dilger [ 21/Aug/13 ]

This might be helped by Alex's patch http://review.whamcloud.com/7343.

Comment by Andreas Dilger [ 21/Aug/13 ]

Keith, it might depend on where the timestamps are generated (e.g. stuck node or console logger) and/or when the console output is being sent. The "Random 2 min jump" lines up pretty well with "blocked for more than 120 seconds"...

Comment by Keith Mannthey (Inactive) [ 21/Aug/13 ]

We are seeing alot of this on Rosso. I would expect to the 2 min gap between starting the test and the "blocked for more than 120 seconds message" not just during the printing of the backtrace. There is an open TT TT-1615 to look into this issue sometimes it is more than 2 min... Maybe it is just a weird console timestamp.

Comment by Jian Yu [ 01/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/46/
FSTYPE=zfs

The same failure occurred: https://maloo.whamcloud.com/test_sets/b72e57a6-3061-11e3-b28a-52540035b04c

Comment by Jian Yu [ 04/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
OSTCOUNT=2

The same failure occurred:
https://maloo.whamcloud.com/test_sets/fcce24ee-4452-11e3-8472-52540035b04c

Comment by Bob Glossman (Inactive) [ 04/Nov/13 ]

Seen only on zfs and only on full tests and only on b2_4.
Only 5 instances seen total.

Comment by Bob Glossman (Inactive) [ 04/Nov/13 ]

Think full tests is a red herring. Suspect performance-sanity is only run in full tests, not in other groups of tests. As far as I can tell all instances of both PASS and TIMEOUT are only in full tests.

Comment by Jian Yu [ 18/Nov/13 ]

The same failure also occurred on master branch with FSTYPE=zfs:
https://maloo.whamcloud.com/test_sets/0951f362-4f2f-11e3-9462-52540035b04c

Comment by Jian Yu [ 18/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/50/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/7bb7ce82-4c9f-11e3-8ab0-52540035b04c

Comment by Jian Yu [ 27/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-reviews/19580/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
MDSCOUNT=4

The same failure occurred: https://maloo.whamcloud.com/test_sets/0951f362-4f2f-11e3-9462-52540035b04c
Console log on Client 2 (wtm-25vm6):

09:08:01:Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 6793 files.
09:08:01:INFO: task mkdir:17358 blocked for more than 120 seconds.
09:08:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
09:08:01:mkdir         D 0000000000000000     0 17358  17349 0x00000080
09:08:01: ffff88006e74fdd8 0000000000000082 ffffffffa0ab7260 0000000000000000
09:08:01: ffff88006e74fe48 ffff880037bb9540 ffff880037bb9540 ffff880037bb9540
09:08:01: ffff880037bb9af8 ffff88006e74ffd8 000000000000fb88 ffff880037bb9af8
09:08:01:Call Trace:
09:08:01: [<ffffffff811a2100>] ? mntput_no_expire+0x30/0x110
09:08:01: [<ffffffff8150f81e>] __mutex_lock_slowpath+0x13e/0x180
09:08:01: [<ffffffff8150f6bb>] mutex_lock+0x2b/0x50
09:08:01: [<ffffffff8118ebe0>] lookup_create+0x30/0xd0
09:08:01: [<ffffffff811924dc>] sys_mkdirat+0x7c/0x130
09:08:01: [<ffffffff811a2100>] ? mntput_no_expire+0x30/0x110
09:08:01: [<ffffffff810dc937>] ? audit_syscall_entry+0x1d7/0x200
09:08:01: [<ffffffff811925a8>] sys_mkdir+0x18/0x20
09:08:01: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Jian Yu [ 16/Dec/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/67/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/9489aa9a-65d9-11e3-9f05-52540035b04c

Comment by Jian Yu [ 06/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/56a48888-76b6-11e3-8c14-52540035b04c

Comment by Jian Yu [ 10/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/40/ (2.5.1 RC2)
FSTYPE=zfs

The same failure occurred:
https://maloo.whamcloud.com/test_sets/77cf0a5a-a84e-11e3-98cc-52540035b04c

Comment by Jian Yu [ 24/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/
Distro/Arch: RHEL6.5/x86_64
FSTYPE=zfs

The same failure occurred:
https://testing.hpdd.intel.com/test_sets/d0672d9c-2b24-11e4-bb80-5254006e85c2

Comment by Sarah Liu [ 17/Jun/15 ]

lustre-master build #3071

https://testing.hpdd.intel.com/test_sets/925acf00-13e4-11e5-9804-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ]

Another instance for EL6.7 Server/EL6.7 Client - ZFS
Master, build# 3270
https://testing.hpdd.intel.com/test_sets/99d7b9ee-a275-11e5-bdef-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9f5fa944-a275-11e5-bdef-5254006e85c2

Comment by Sarah Liu [ 20/Jan/16 ]

https://testing.hpdd.intel.com/test_sets/9abecea0-bc84-11e5-b3b7-5254006e85c2
lustre-master build# 3305 RHEL6.7 zfs

Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ]

Another instance for FULL - EL6.7 Server/EL6.7 Client - ZFS , master, build# 3314.
https://testing.hpdd.intel.com/test_sets/a60b5630-cb47-11e5-a59a-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/a60b5630-cb47-11e5-a59a-5254006e85c2

Comment by Sarah Liu [ 10/Mar/17 ]

also seen on master tag-2.9.53 RHEL7.3 zfs
https://testing.hpdd.intel.com/test_sets/18a65a20-f8f9-11e6-aac4-5254006e85c2

Generated at Sat Feb 10 01:36:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.