Details
Description
During some tests I notice the following debug message on the console, and I suspect it is a sign of a resource leak in some code path that should be cleaned up.
LustreError: 22089:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace MGC192.168.20.154@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 22089:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff880058917200 (126883877578100/0/0/0) (rc: 0)
On my system, the LDLM resource ID is always the same - 126883877578100 = 0x736674736574, which happens to be the ASCII (in reverse order) for the Lustre fsname of the filesystem being tested, "testfs".
I don't know when the problem started for sure, but it is in my /var/log/messages file as far back as I have records of Lustre testing on this machine, 2012/09/02.
The tests that report this include:
- replay-single:
- test_0c
- test_10
- test_13
- test_14
- test_15
- test_17
- test_19
- test_22
- test_24
- test_28
- test_53b
- test_59
- replay-dual
- test_5
- test_6
- test_9
- insanity
- test_0
Note that in my older runs (2012-09-10) the list of tests is very similar, but not exactly the same. I don't know if this indicates that the failure is due to a race condition (so it only hits on a percentage of tests), or if the leak happens differently in the newer code.
Attachments
Issue Links
- is related to
-
LU-8792 Interop - master<->2.8 :sanity-hsm test_107: hung while umount MDT
-
- Closed
-
I investigated this a bit when I was working on
LU-3460, looks it's possible that locks have reader/writer when ldlm_namespace_cleanup() is called. following is comment fromLU-3460:It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario:
■ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer;
■before the enqueue request is added in imp_sending_list or imp_delayed_list, shutdown happened;
■shutdown procedure aborted inflight RPCs, but the enqueue request can't be aborted since it's neither on sending list nor delayed list;
■shutdown procedure moving on to obd_import_event(IMP_EVENT_ACTIVE)->ldlm_namespace_cleanup() to cleanup all locks;
■ldlm_namespace_cleanup() found that the lock just created still has 1 reader/writer, because the interpret callback for this lock enqueue hasn't been called yet (where the reader/writer is dropped;
That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally.