[LU-2067] ldlm_resource_complain()) Namespace MGC resource refcount nonzero after lock cleanup Created: 01/Oct/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl, shh | ||
| Environment: |
Single node test system (MGS, MDS, OSS, client on one node), x86_64, 2GB RAM, Lustre v2_3_51_0-2-g0810df3 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4317 | ||||||||
| Description |
|
During some tests I notice the following debug message on the console, and I suspect it is a sign of a resource leak in some code path that should be cleaned up. LustreError: 22089:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace MGC192.168.20.154@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 22089:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff880058917200 (126883877578100/0/0/0) (rc: 0) On my system, the LDLM resource ID is always the same - 126883877578100 = 0x736674736574, which happens to be the ASCII (in reverse order) for the Lustre fsname of the filesystem being tested, "testfs". I don't know when the problem started for sure, but it is in my /var/log/messages file as far back as I have records of Lustre testing on this machine, 2012/09/02. The tests that report this include:
Note that in my older runs (2012-09-10) the list of tests is very similar, but not exactly the same. I don't know if this indicates that the failure is due to a race condition (so it only hits on a percentage of tests), or if the leak happens differently in the newer code. |
| Comments |
| Comment by Andreas Dilger [ 01/Oct/12 ] |
|
Added Jinshan since I suspect this is related to IR, and Liang since he added the original checking code. |
| Comment by Jinshan Xiong (Inactive) [ 01/Oct/12 ] |
|
it looks like this is race somewhere. The first message showed the refcount is 2 and it became to 1(expected) in the 2nd message. |
| Comment by Prakash Surya (Inactive) [ 28/Nov/12 ] |
|
We see this constantly on our production 2.1 based systems. # zgrep resource /var/log/lustre.log Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST00a9-osc-ffff880823503800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88042fefb300 (13829255/0/0/0) (rc: 1) Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0052-osc-ffff880823f84000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8802e18b3d40 (14157121/0/0/0) (rc: 1) Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0057-osc-ffff88040742d800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88035bf2ce40 (13850406/0/0/0) (rc: 1) |
| Comment by Christopher Morrone [ 15/Mar/13 ] |
|
We need to silence this message. Until the race is fixed, I am redirecting it to D_DLMTRACE in change 5736. |
| Comment by Andreas Dilger [ 20/Mar/13 ] |
|
Marking this ALWAYS_EXCEPT, since we aren't really fixing the problem, just silencing the warning message. I do NOT want this bug closed if Chris' patch is landed. |
| Comment by Niu Yawei (Inactive) [ 25/Jun/13 ] |
|
I investigated this a bit when I was working on It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario: ■ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer; That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally. |
| Comment by Li Xi (Inactive) [ 09/Jul/14 ] |
|
Hi Yawei, Your analysis is really interesting. Is it possible to make a patch to fix that race problem? ldlm_resource_dump() didn't print any granted/converting/waiting locks, that means the locks holding the resource were beening enqueued, right? |
| Comment by Niu Yawei (Inactive) [ 10/Jul/14 ] |
Yes, I suppose the lock hasn't been put onto any list yet. I didn't figure out a good solution so far. |