Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3889

LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • CentOS 6.4 running fairly recent master.
    • 3
    • 10138

    Description

      This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.

      This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)

      Two things are needed: A reproducer script, and memory pressure on the system.

      The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]

      for idx in $(seq 0 10000); do
          time ls -laR > /dev/null
          touch somefile
          rm -f somefiles
          echo $idx: $(date +%T) $(grep MemFree /proc/meminfo)
      done
      


      I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.

      Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.

      #include <stdio.h>
      #include <unistd.h>
      
      int main()
      {
          int i;
          char* junk;
      
      start: i = 0;
      
          while(i < 50) { 
              printf("Malloc!\n"); 
              junk = malloc(1024*1024*1024); 
              junk[0] = i; 
              i++; 
          }
      
          printf("Mallocced 50 GB. Press enter to malloc another 50.\n");
          printf("Note: This seems to use roughly 10 MB of real memory each time.\n");
          getchar();
          goto start;
      }
      

      Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
      As per my understanding of the code, osc_lock_enqueue() function enqueue the
      lock and do not wait for network communication.After reply from server we
      execute the call back function i.e. osc_lock_upcall() for the lock enqueue
      through osc_lock_enqueue().

      In this case after successful enqueue and before we get reply from server
      (or call to the osc_lock_upcall()), I see in the log that we unused the lock
      and hence the LBUG.

      I will investigate more and update accordingly.

      Attachments

        Issue Links

          Activity

            [LU-3889] LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )

            Patch 8234 never made it into b2_5 while 7778 with a typo did:

            http://review.whamcloud.com/10731

            utopiabound Nathaniel Clark added a comment - Patch 8234 never made it into b2_5 while 7778 with a typo did: http://review.whamcloud.com/10731
            niu Niu Yawei (Inactive) added a comment - for b2_4: http://review.whamcloud.com/9194

            Also hit this bug for 2.4

            simmonsja James A Simmons added a comment - Also hit this bug for 2.4
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.5.1 and 2.6

            pjones Peter Jones added a comment - Landed for 2.5.1 and 2.6
            pjones Peter Jones made changes -
            Labels Original: 11i HB mq114 New: HB mn4
            bogl Bob Glossman (Inactive) added a comment - in b2_5: http://review.whamcloud.com/8717
            yujian Jian Yu made changes -
            Labels Original: 11i HB New: 11i HB mq114
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5
            Distro/Arch: RHEL6.4/x86_64
            MDSCOUNT=1

            racer test hit the same failure:
            https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5 Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c
            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/d15a7052-68ff-11e3-ab68-52540035b04c https://maloo.whamcloud.com/test_sets/1cb24ab0-691f-11e3-8dc5-52540035b04c

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: