Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3889

LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • CentOS 6.4 running fairly recent master.
    • 3
    • 10138

    Description

      This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.

      This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)

      Two things are needed: A reproducer script, and memory pressure on the system.

      The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]
      —

      for idx in $(seq 0 10000); do
          time ls -laR > /dev/null
          touch somefile
          rm -f somefiles
          echo $idx: $(date +%T) $(grep MemFree /proc/meminfo)
      done
      

      —
      I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.

      Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.

      #include <stdio.h>
      #include <unistd.h>
      
      int main()
      {
          int i;
          char* junk;
      
      start: i = 0;
      
          while(i < 50) { 
              printf("Malloc!\n"); 
              junk = malloc(1024*1024*1024); 
              junk[0] = i; 
              i++; 
          }
      
          printf("Mallocced 50 GB. Press enter to malloc another 50.\n");
          printf("Note: This seems to use roughly 10 MB of real memory each time.\n");
          getchar();
          goto start;
      }
      

      Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
      As per my understanding of the code, osc_lock_enqueue() function enqueue the
      lock and do not wait for network communication.After reply from server we
      execute the call back function i.e. osc_lock_upcall() for the lock enqueue
      through osc_lock_enqueue().

      In this case after successful enqueue and before we get reply from server
      (or call to the osc_lock_upcall()), I see in the log that we unused the lock
      and hence the LBUG.

      I will investigate more and update accordingly.

      Attachments

        Issue Links

          Activity

            [LU-3889] LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5
            Distro/Arch: RHEL6.4/x86_64
            MDSCOUNT=1

            racer test hit the same failure:
            https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5 Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c
            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/d15a7052-68ff-11e3-ab68-52540035b04c https://maloo.whamcloud.com/test_sets/1cb24ab0-691f-11e3-8dc5-52540035b04c
            yujian Jian Yu added a comment - - edited
            yujian Jian Yu added a comment - - edited Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) MDSCOUNT=4 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/06c7507e-6875-11e3-a9a3-52540035b04c parallel-scale-nfsv4 test iorssf hit the same failure: https://maloo.whamcloud.com/test_sets/d1809a06-6879-11e3-a9a3-52540035b04c

            Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work.

            Jinshan

            jay Jinshan Xiong (Inactive) added a comment - Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work. Jinshan

            I have added the regression test for this issue http://review.whamcloud.com/8463. I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall().

            aboyko Alexander Boyko added a comment - I have added the regression test for this issue http://review.whamcloud.com/8463 . I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall().

            Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them.

            paf Patrick Farrell (Inactive) added a comment - Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them.

            Hi Patrick, it's fine to take out of OSC changes.

            jay Jinshan Xiong (Inactive) added a comment - Hi Patrick, it's fine to take out of OSC changes.

            We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen.

            We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead.

            paf Patrick Farrell (Inactive) added a comment - We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen. We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead.

            Jinshan,

            We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356

            This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems.

            We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything.

            paf Patrick Farrell (Inactive) added a comment - Jinshan, We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356 This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems. We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything.

            Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all.

            We're going to try without the OSC changes next.

            Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion?

            jay Jinshan Xiong (Inactive) added a comment - Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all. We're going to try without the OSC changes next. Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion?

            Jay,

            can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that?

            shadow Alexey Lyashkov added a comment - Jay, can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that?

            People

              jay Jinshan Xiong (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: