[LU-4034] Cannot allocate memory on clients with 2.4.X - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
SL6.4, 2.4.1 servers and clients with some patches, which have landed to b2_4 after 2.4.1 freeze.

Severity:
3
Rank (Obsolete):
10838

Description

One of our user noticed a strange problem during metadata operations, it looks like a memory allocation issue:
[root@XXX ~]# ls -l /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438
ls: cannot access /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438: Cannot allocate memory

The client log says:
Oct 1 16:20:11 zeus kernel: LustreError: 11-0: scratch-OST0013-osc-ffff8804925f1400: Communicating with 172.16.126.4@tcp, operation ldlm_enqueue failed with -12.
Oct 1 16:20:11 zeus kernel: LustreError: 23207:0:(cl_lock.c:1420:cl_unuse_try()) result = -12, this is unlikely!

OSS log has:
Oct 1 16:20:11 scratch02 kernel: LustreError: 4630:0:(ldlm_resource.c:1165:ldlm_resource_get()) scratch-OST0013: lvbo_init failed for resource 0x40d9dcf:0x0: rc = -2

Of course both servers and cients still have plenty of memory available. I've tried to look at similar issues in Jira, however I wasn't able to find a ticket with 1:1 relation to our issue.

Attachments

Activity

[LU-4034] Cannot allocate memory on clients with 2.4.X

Lukasz Flis added a comment - 10/Sep/18 7:39 PM

We haven't seen this problem since upgrading to 2.5 and newer releases in Cyfronet.

I'd vote for closing it and Marek will probably confirm

Lukasz Flis added a comment - 10/Sep/18 7:39 PM We haven't seen this problem since upgrading to 2.5 and newer releases in Cyfronet. I'd vote for closing it and Marek will probably confirm

James A Simmons added a comment - 10/Sep/18 4:34 PM

Can we close this?

James A Simmons added a comment - 10/Sep/18 4:34 PM Can we close this?

Alex Kulyavtsev added a comment - 08/Sep/16 2:50 AM

I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at ~~LU-4524~~.
Should this ticket ~~LU-4034~~ be closed?

Alex Kulyavtsev added a comment - 08/Sep/16 2:50 AM I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at LU-4524 . Should this ticket LU-4034 be closed?

Kit Westneat (Inactive) added a comment - 21/Jan/14 11:05 PM

I create ~~LU-4524~~ for the error code problem, as we have run into that too.

Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went.

AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files.

Kit Westneat (Inactive) added a comment - 21/Jan/14 11:05 PM I create LU-4524 for the error code problem, as we have run into that too. Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went. AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files.

Marek Magrys added a comment - 19/Dec/13 5:51 PM

Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck?

Marek Magrys added a comment - 19/Dec/13 5:51 PM Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck?

John Hammond added a comment - 01/Oct/13 2:31 PM

This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure.

        /* The lock's callback data might be set in the policy function */
        lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
                                dlm_req->lock_desc.l_resource.lr_type,
                                dlm_req->lock_desc.l_req_mode,
                                cbs, NULL, 0, LVB_T_NONE);
        if (!lock)
                GOTO(out, rc = -ENOMEM);

John Hammond added a comment - 01/Oct/13 2:31 PM This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure. /* The lock's callback data might be set in the policy function */ lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name, dlm_req->lock_desc.l_resource.lr_type, dlm_req->lock_desc.l_req_mode, cbs, NULL, 0, LVB_T_NONE); if (!lock) GOTO(out, rc = -ENOMEM);

People

Assignee:: WC Triage

Reporter:: Marek Magrys

Votes:: 3 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 01/Oct/13 2:22 PM

Updated:: 21/Jan/22 1:25 AM

Resolved:: 21/Jan/22 1:25 AM