Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4034

Cannot allocate memory on clients with 2.4.X

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • SL6.4, 2.4.1 servers and clients with some patches, which have landed to b2_4 after 2.4.1 freeze.
    • 3
    • 10838

    Description

      One of our user noticed a strange problem during metadata operations, it looks like a memory allocation issue:
      [root@XXX ~]# ls -l /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438
      ls: cannot access /mnt/lustre/scratch/people/YYYY/SPE.SPIN/050524/28/temp.a438: Cannot allocate memory

      The client log says:
      Oct 1 16:20:11 zeus kernel: LustreError: 11-0: scratch-OST0013-osc-ffff8804925f1400: Communicating with 172.16.126.4@tcp, operation ldlm_enqueue failed with -12.
      Oct 1 16:20:11 zeus kernel: LustreError: 23207:0:(cl_lock.c:1420:cl_unuse_try()) result = -12, this is unlikely!

      OSS log has:
      Oct 1 16:20:11 scratch02 kernel: LustreError: 4630:0:(ldlm_resource.c:1165:ldlm_resource_get()) scratch-OST0013: lvbo_init failed for resource 0x40d9dcf:0x0: rc = -2

      Of course both servers and cients still have plenty of memory available. I've tried to look at similar issues in Jira, however I wasn't able to find a ticket with 1:1 relation to our issue.

      Attachments

        Activity

          [LU-4034] Cannot allocate memory on clients with 2.4.X
          lflis Lukasz Flis added a comment -

          We haven't seen this problem since upgrading to  2.5 and newer releases in Cyfronet.

          I'd vote for closing it and Marek will probably confirm

           

           

          lflis Lukasz Flis added a comment - We haven't seen this problem since upgrading to  2.5 and newer releases in Cyfronet. I'd vote for closing it and Marek will probably confirm    

          Can we close this?

          simmonsja James A Simmons added a comment - Can we close this?

          I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at LU-4524.
          Should this ticket LU-4034 be closed?

          alex.ku Alex Kulyavtsev added a comment - I'm getting similar error when trying to "ls" file which failed to create object (lustre 2.5.3). It looks like the error during "ls" is addressed at LU-4524 . Should this ticket LU-4034 be closed?

          I create LU-4524 for the error code problem, as we have run into that too.

          Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went.

          AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files.

          kitwestneat Kit Westneat (Inactive) added a comment - I create LU-4524 for the error code problem, as we have run into that too. Marek, it looks like the object is missing on the OSS. Basically you need to figure out why the objects are missing. Was there a hard crash or data corruption on the OSS? I would look for something earlier in the logs that might explain where the objects (0x40d9dcf in this case) went. AFAIK the lfsck in 2.4.1 does not do this kind of cleanup, that is going to be in a later phase of the lfsck rewrite. You should be able to do what the old lfsck would do and unlink the files.
          m.magrys Marek Magrys added a comment -

          Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck?

          m.magrys Marek Magrys added a comment - Any ideas anyone? I totally agree on the error reporting improvement idea, but it's a side problem here I guess. It still looks like that some object cannot be located. I've found a bunch of files with the same problem and I wonder if it's possible to find a solution other than taking the fs offline and running lfsck?
          jhammond John Hammond added a comment -

          This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure.

                  /* The lock's callback data might be set in the policy function */
                  lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
                                          dlm_req->lock_desc.l_resource.lr_type,
                                          dlm_req->lock_desc.l_req_mode,
                                          cbs, NULL, 0, LVB_T_NONE);
                  if (!lock)
                          GOTO(out, rc = -ENOMEM);
          
          jhammond John Hammond added a comment - This reminds me that the error reporting could be improved here. ldlm_resource_get() returned NULL because the resource could not be found. Then ldlm_lock_create() returns NULL. The ldlm_handle_enqueue0() misinterprets the returned NULL as being due to an allocation failure. /* The lock's callback data might be set in the policy function */ lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name, dlm_req->lock_desc.l_resource.lr_type, dlm_req->lock_desc.l_req_mode, cbs, NULL, 0, LVB_T_NONE); if (!lock) GOTO(out, rc = -ENOMEM);

          People

            wc-triage WC Triage
            m.magrys Marek Magrys
            Votes:
            3 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: