[LU-4034] Cannot allocate memory on clients with 2.4.X Created: 01/Oct/13 Updated: 21/Jan/22 Resolved: 21/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marek Magrys | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 3 |
| Labels: | None | ||
| Environment: |
SL6.4, 2.4.1 servers and clients with some patches, which have landed to b2_4 after 2.4.1 freeze. |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 10838 | ||||
| Description |
|
One of our user noticed a strange problem during metadata operations, it looks like a memory allocation issue: The client log says: OSS log has: Of course both servers and cients still have plenty of memory available. I've tried to look at similar issues in Jira, however I wasn't able to find a ticket with 1:1 relation to our issue. |
| Comments |
| Comment by John Hammond [ 01/Oct/13 ] |
|
This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure. /* The lock's callback data might be set in the policy function */
lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
dlm_req->lock_desc.l_resource.lr_type,
dlm_req->lock_desc.l_req_mode,
cbs, NULL, 0, LVB_T_NONE);
if (!lock)
GOTO(out, rc = -ENOMEM);
|
| Comment by Marek Magrys [ 19/Dec/13 ] |
|
Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck? |
| Comment by Kit Westneat (Inactive) [ 21/Jan/14 ] |
|
I create Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went. AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files. |
| Comment by Alex Kulyavtsev [ 08/Sep/16 ] |
|
I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at |
| Comment by James A Simmons [ 10/Sep/18 ] |
|
Can we close this? |
| Comment by Lukasz Flis [ 10/Sep/18 ] |
|
We haven't seen this problem since upgrading to 2.5 and newer releases in Cyfronet. I'd vote for closing it and Marek will probably confirm
|