Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3889

LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • CentOS 6.4 running fairly recent master.
    • 3
    • 10138

    Description

      This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.

      This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)

      Two things are needed: A reproducer script, and memory pressure on the system.

      The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]

      for idx in $(seq 0 10000); do
          time ls -laR > /dev/null
          touch somefile
          rm -f somefiles
          echo $idx: $(date +%T) $(grep MemFree /proc/meminfo)
      done
      


      I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.

      Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.

      #include <stdio.h>
      #include <unistd.h>
      
      int main()
      {
          int i;
          char* junk;
      
      start: i = 0;
      
          while(i < 50) { 
              printf("Malloc!\n"); 
              junk = malloc(1024*1024*1024); 
              junk[0] = i; 
              i++; 
          }
      
          printf("Mallocced 50 GB. Press enter to malloc another 50.\n");
          printf("Note: This seems to use roughly 10 MB of real memory each time.\n");
          getchar();
          goto start;
      }
      

      Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
      As per my understanding of the code, osc_lock_enqueue() function enqueue the
      lock and do not wait for network communication.After reply from server we
      execute the call back function i.e. osc_lock_upcall() for the lock enqueue
      through osc_lock_enqueue().

      In this case after successful enqueue and before we get reply from server
      (or call to the osc_lock_upcall()), I see in the log that we unused the lock
      and hence the LBUG.

      I will investigate more and update accordingly.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: