[LU-6941] sanity test_209: open/close requests are not freed Created: 02/Aug/15  Updated: 08/Apr/20  Resolved: 08/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: p4hc

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/81d99640-38ea-11e5-8dec-5254006e85c2.

The sub-test test_209 failed with the following error:

open/close requests are not freed

Looks like LU-4270, but that is marked Closed.

Info required for matching: sanity 209



 Comments   
Comment by Bob Glossman (Inactive) [ 09/Aug/15 ]

more on master:
https://testing.hpdd.intel.com/test_sets/72f0624c-3e05-11e5-8300-5254006e85c2
https://testing.hpdd.intel.com/test_sets/450d6c92-3e39-11e5-a492-5254006e85c2

Comment by Jian Yu [ 11/Aug/15 ]

By searching on Maloo, I found since 2015-02-13 all of the failure instances occurred on IB network (nodes onyx-[64-67]-ib) .

Comment by Peter Jones [ 11/Aug/15 ]

Hongchao

Could you please look into this issue as a priority?

Thanks

Peter

Comment by Hongchao Zhang [ 12/Aug/15 ]

as per the logs, there are only about 130 ptlrpc requests sent to MDT to open the file (LDLM_ENQUEUE), and there are about 470
(LDLM_CANCEL) ptlrpc requests sent to the MDT, then the test script should be improved to adapt it.

Comment by Jian Yu [ 16/Aug/15 ]

More failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/2e32e524-428a-11e5-9941-5254006e85c2

Comment by Gerrit Updater [ 17/Aug/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16001
Subject: LU-6941 mdc: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e7004c63c85041923082a95ece5e55a6931ba557

Comment by James Nunez (Inactive) [ 18/Aug/15 ]

Another failure on master:
2015-08-17 14:31:06 - review-zfs-part-1 - https://testing.hpdd.intel.com/test_sets/23814aac-4524-11e5-a64b-5254006e85c2

Comment by James Nunez (Inactive) [ 10/Sep/15 ]

Another on master, review-zfs at
2015-09-10 01:57:56 - https://testing.hpdd.intel.com/test_sets/7d8651c8-57b7-11e5-a084-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 21/Sep/15 ]

Failure today on master:
https://testing.hpdd.intel.com/test_sessions/0a6e7516-6050-11e5-b0a3-5254006e85c2

Comment by Hongchao Zhang [ 25/Sep/15 ]

status update:
debug patch http://review.whamcloud.com/#/c/16001/ run "sanity" 3 times without this problem.
have retriggered it to test more times.

Comment by Bruno Faccini (Inactive) [ 22/Oct/15 ]

+1 on master at https://testing.hpdd.intel.com/test_sets/c3b65a8e-7888-11e5-a98c-5254006e85c2

Comment by James Nunez (Inactive) [ 22/Oct/15 ]

Another on master at https://testing.hpdd.intel.com/test_sets/1f89e264-7829-11e5-9072-5254006e85c2

Comment by Sarah Liu [ 10/Dec/15 ]

hit this after upgrade system from 2.5.5 RHEL6.6 to master/3264 RHEL6.7

onyx-28: == sanity test 209: read-only open/close requests should be freed promptly == 18:35:49 (1449714949)
onyx-28: before: 39, after: 360
onyx-28:  sanity test_209: @@@@@@ FAIL: open/close requests are not freed 
onyx-28:   Trace dump:
onyx-28:   = /usr/lib64/lustre/tests/test-framework.sh:4822:error_noexit()
onyx-28:   = /usr/lib64/lustre/tests/test-framework.sh:4853:error()
onyx-28:   = /usr/lib64/lustre/tests/sanity.sh:11943:test_209()
onyx-28:   = /usr/lib64/lustre/tests/test-framework.sh:5100:run_one()
onyx-28:   = /usr/lib64/lustre/tests/test-framework.sh:5137:run_one_logged()
onyx-28:   = /usr/lib64/lustre/tests/test-framework.sh:5002:run_test()
onyx-28:   = /usr/lib64/lustre/tests/sanity.sh:11946:main()

Comment by Hongchao Zhang [ 01/Jul/16 ]

status update:
no new occurrence since Jan 12, 2016

Comment by Gerrit Updater [ 29/Sep/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22808
Subject: LU-6941 test: use multiop to trigger open/close
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b718c8cf201f94d16344ae11886fa58bda9214e

Comment by Hongchao Zhang [ 29/Sep/16 ]

By investigating the recent occurrences, this issue could be triggered by the ptlrpc_request history cache of "ldlm_cbd" service.
the corresponding patch is tracked at http://review.whamcloud.com/22808

Comment by Hongchao Zhang [ 25/Apr/18 ]

the patch has been rebased to the latest master

Comment by Andreas Dilger [ 08/Apr/20 ]

Close old issue that has not been seen in a long time.

Generated at Sat Feb 10 02:04:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.