[LU-4034] Cannot allocate memory on clients with 2.4.X Created: 01/Oct/13  Updated: 21/Jan/22  Resolved: 21/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Marek Magrys Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 3
Labels: None
Environment:

SL6.4, 2.4.1 servers and clients with some patches, which have landed to b2_4 after 2.4.1 freeze.


Issue Links:
Related
Severity: 3
Rank (Obsolete): 10838

 Description   

One of our user noticed a strange problem during metadata operations, it looks like a memory allocation issue:
[root@XXX ~]# ls -l /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438
ls: cannot access /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438: Cannot allocate memory

The client log says:
Oct 1 16:20:11 zeus kernel: LustreError: 11-0: scratch-OST0013-osc-ffff8804925f1400: Communicating with 172.16.126.4@tcp, operation ldlm_enqueue failed with -12.
Oct 1 16:20:11 zeus kernel: LustreError: 23207:0:(cl_lock.c:1420:cl_unuse_try()) result = -12, this is unlikely!

OSS log has:
Oct 1 16:20:11 scratch02 kernel: LustreError: 4630:0:(ldlm_resource.c:1165:ldlm_resource_get()) scratch-OST0013: lvbo_init failed for resource 0x40d9dcf:0x0: rc = -2

Of course both servers and cients still have plenty of memory available. I've tried to look at similar issues in Jira, however I wasn't able to find a ticket with 1:1 relation to our issue.



 Comments   
Comment by John Hammond [ 01/Oct/13 ]

This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure.

        /* The lock's callback data might be set in the policy function */
        lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
                                dlm_req->lock_desc.l_resource.lr_type,
                                dlm_req->lock_desc.l_req_mode,
                                cbs, NULL, 0, LVB_T_NONE);
        if (!lock)
                GOTO(out, rc = -ENOMEM);
Comment by Marek Magrys [ 19/Dec/13 ]

Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck?

Comment by Kit Westneat (Inactive) [ 21/Jan/14 ]

I create LU-4524 for the error code problem, as we have run into that too.

Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went.

AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files.

Comment by Alex Kulyavtsev [ 08/Sep/16 ]

I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at LU-4524.
Should this ticket LU-4034 be closed?

Comment by James A Simmons [ 10/Sep/18 ]

Can we close this?

Comment by Lukasz Flis [ 10/Sep/18 ]

We haven't seen this problem since upgrading to  2.5 and newer releases in Cyfronet.

I'd vote for closing it and Marek will probably confirm

 

 

Generated at Sat Feb 10 01:39:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.