[LU-2067] ldlm_resource_complain()) Namespace MGC resource refcount nonzero after lock cleanup - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
- shh
Environment:
Single node test system (MGS, MDS, OSS, client on one node), x86_64, 2GB RAM, Lustre v2_3_51_0-2-g0810df3

Severity:
3
Rank (Obsolete):
4317

Description

During some tests I notice the following debug message on the console, and I suspect it is a sign of a resource leak in some code path that should be cleaned up.

LustreError: 22089:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace MGC192.168.20.154@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 22089:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff880058917200 (126883877578100/0/0/0) (rc: 0)

On my system, the LDLM resource ID is always the same - 126883877578100 = 0x736674736574, which happens to be the ASCII (in reverse order) for the Lustre fsname of the filesystem being tested, "testfs".

I don't know when the problem started for sure, but it is in my /var/log/messages file as far back as I have records of Lustre testing on this machine, 2012/09/02.

The tests that report this include:

replay-single:
- test_0c
- test_10
- test_13
- test_14
- test_15
- test_17
- test_19
- test_22
- test_24
- test_28
- test_53b
- test_59
replay-dual
- test_5
- test_6
- test_9
insanity
- test_0

Note that in my older runs (2012-09-10) the list of tests is very similar, but not exactly the same. I don't know if this indicates that the failure is due to a race condition (so it only hits on a percentage of tests), or if the leak happens differently in the newer code.

Attachments

Issue Links

is related to

LU-8792 Interop - master<->2.8 :sanity-hsm test_107: hung while umount MDT

Closed

Activity

[LU-2067] ldlm_resource_complain()) Namespace MGC resource refcount nonzero after lock cleanup

Niu Yawei (Inactive) added a comment - 25/Jun/13 5:33 AM

I investigated this a bit when I was working on ~~LU-3460~~, looks it's possible that locks have reader/writer when ldlm_namespace_cleanup() is called. following is comment from ~~LU-3460~~:

It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario:

■ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer;
■before the enqueue request is added in imp_sending_list or imp_delayed_list, shutdown happened;
■shutdown procedure aborted inflight RPCs, but the enqueue request can't be aborted since it's neither on sending list nor delayed list;
■shutdown procedure moving on to obd_import_event(IMP_EVENT_ACTIVE)->ldlm_namespace_cleanup() to cleanup all locks;
■ldlm_namespace_cleanup() found that the lock just created still has 1 reader/writer, because the interpret callback for this lock enqueue hasn't been called yet (where the reader/writer is dropped;

That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally.

Niu Yawei (Inactive) added a comment - 25/Jun/13 5:33 AM I investigated this a bit when I was working on LU-3460 , looks it's possible that locks have reader/writer when ldlm_namespace_cleanup() is called. following is comment from LU-3460 : It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario: ■ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer; ■before the enqueue request is added in imp_sending_list or imp_delayed_list, shutdown happened; ■shutdown procedure aborted inflight RPCs, but the enqueue request can't be aborted since it's neither on sending list nor delayed list; ■shutdown procedure moving on to obd_import_event(IMP_EVENT_ACTIVE)->ldlm_namespace_cleanup() to cleanup all locks; ■ldlm_namespace_cleanup() found that the lock just created still has 1 reader/writer, because the interpret callback for this lock enqueue hasn't been called yet (where the reader/writer is dropped; That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally.

Andreas Dilger added a comment - 20/Mar/13 8:25 PM

Marking this ALWAYS_EXCEPT, since we aren't really fixing the problem, just silencing the warning message. I do NOT want this bug closed if Chris' patch is landed.

Andreas Dilger added a comment - 20/Mar/13 8:25 PM Marking this ALWAYS_EXCEPT, since we aren't really fixing the problem, just silencing the warning message. I do NOT want this bug closed if Chris' patch is landed.

Christopher Morrone (Inactive) added a comment - 15/Mar/13 7:05 PM

We need to silence this message. Until the race is fixed, I am redirecting it to D_DLMTRACE in change 5736.

Christopher Morrone (Inactive) added a comment - 15/Mar/13 7:05 PM We need to silence this message. Until the race is fixed, I am redirecting it to D_DLMTRACE in change 5736 .

Prakash Surya (Inactive) added a comment - 28/Nov/12 6:26 PM

We see this constantly on our production 2.1 based systems.

# zgrep resource /var/log/lustre.log                                            
Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST00a9-osc-ffff880823503800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88042fefb300 (13829255/0/0/0) (rc: 1)
Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0052-osc-ffff880823f84000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8802e18b3d40 (14157121/0/0/0) (rc: 1)
Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0057-osc-ffff88040742d800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88035bf2ce40 (13850406/0/0/0) (rc: 1)

Prakash Surya (Inactive) added a comment - 28/Nov/12 6:26 PM We see this constantly on our production 2.1 based systems. # zgrep resource /var/log/lustre.log Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST00a9-osc-ffff880823503800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 03:19:12 cab1173 kernel: LustreError: 68932:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88042fefb300 (13829255/0/0/0) (rc: 1) Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0052-osc-ffff880823f84000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 03:33:49 cab994 kernel: LustreError: 22038:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8802e18b3d40 (14157121/0/0/0) (rc: 1) Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lsc-OST0057-osc-ffff88040742d800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Nov 28 04:37:50 cab623 kernel: LustreError: 65860:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff88035bf2ce40 (13850406/0/0/0) (rc: 1)

Jinshan Xiong (Inactive) added a comment - 01/Oct/12 6:15 PM

it looks like this is race somewhere. The first message showed the refcount is 2 and it became to 1(expected) in the 2nd message.

Jinshan Xiong (Inactive) added a comment - 01/Oct/12 6:15 PM it looks like this is race somewhere. The first message showed the refcount is 2 and it became to 1(expected) in the 2nd message.

Andreas Dilger added a comment - 01/Oct/12 4:06 PM

Added Jinshan since I suspect this is related to IR, and Liang since he added the original checking code.

Andreas Dilger added a comment - 01/Oct/12 4:06 PM Added Jinshan since I suspect this is related to IR, and Liang since he added the original checking code.

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 01/Oct/12 4:05 PM

Updated:: 29/May/17 6:22 AM

Resolved:: 29/May/17 6:22 AM