[LU-3889] LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
Labels:
- HB
- mn4
Environment:
CentOS 6.4 running fairly recent master.

Severity:
3
Rank (Obsolete):
10138

Description

This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.

This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)

Two things are needed: A reproducer script, and memory pressure on the system.

The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]
—

for idx in $(seq 0 10000); do
    time ls -laR > /dev/null
    touch somefile
    rm -f somefiles
    echo $idx: $(date +%T) $(grep MemFree /proc/meminfo)
done

—
I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.

Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.

#include <stdio.h>
#include <unistd.h>

int main()
{
    int i;
    char* junk;

start: i = 0;

    while(i < 50) { 
        printf("Malloc!\n"); 
        junk = malloc(1024*1024*1024); 
        junk[0] = i; 
        i++; 
    }

    printf("Mallocced 50 GB. Press enter to malloc another 50.\n");
    printf("Note: This seems to use roughly 10 MB of real memory each time.\n");
    getchar();
    goto start;
}

Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
As per my understanding of the code, osc_lock_enqueue() function enqueue the
lock and do not wait for network communication.After reply from server we
execute the call back function i.e. osc_lock_upcall() for the lock enqueue
through osc_lock_enqueue().

In this case after successful enqueue and before we get reply from server
(or call to the osc_lock_upcall()), I see in the log that we unused the lock
and hence the LBUG.

I will investigate more and update accordingly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

osc_lock_LBUG-lu-3889.log.tar.gz
631 kB
05/Sep/13 5:31 PM

Issue Links

is duplicated by

LU-4394 LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed

Resolved

is related to

LU-3027 Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8

Resolved

LU-3433 Encountered a assertion for the ols_state being set to a impossible state

Resolved

is related to

LU-3027 Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8

Resolved

Activity

[LU-3889] LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING )

Jian Yu added a comment - 02/Jan/14 7:59 AM

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=1

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c

Jian Yu added a comment - 02/Jan/14 7:59 AM Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5 Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c

Jian Yu added a comment - 20/Dec/13 7:46 AM

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=1

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/d15a7052-68ff-11e3-ab68-52540035b04c
https://maloo.whamcloud.com/test_sets/1cb24ab0-691f-11e3-8dc5-52540035b04c

Jian Yu added a comment - 20/Dec/13 7:46 AM Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) Distro/Arch: RHEL6.4/x86_64 MDSCOUNT=1 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/d15a7052-68ff-11e3-ab68-52540035b04c https://maloo.whamcloud.com/test_sets/1cb24ab0-691f-11e3-8dc5-52540035b04c

Jian Yu added a comment - 19/Dec/13 12:17 PM - edited

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
MDSCOUNT=4

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/06c7507e-6875-11e3-a9a3-52540035b04c

parallel-scale-nfsv4 test iorssf hit the same failure:
https://maloo.whamcloud.com/test_sets/d1809a06-6879-11e3-a9a3-52540035b04c

Jian Yu added a comment - 19/Dec/13 12:17 PM - edited Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) MDSCOUNT=4 racer test hit the same failure: https://maloo.whamcloud.com/test_sets/06c7507e-6875-11e3-a9a3-52540035b04c parallel-scale-nfsv4 test iorssf hit the same failure: https://maloo.whamcloud.com/test_sets/d1809a06-6879-11e3-a9a3-52540035b04c

Jinshan Xiong (Inactive) added a comment - 11/Dec/13 6:59 PM

Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work.

Jinshan

Jinshan Xiong (Inactive) added a comment - 11/Dec/13 6:59 PM Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work. Jinshan

Alexander Boyko added a comment - 03/Dec/13 6:16 AM

I have added the regression test for this issue http://review.whamcloud.com/8463. I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall().

Alexander Boyko added a comment - 03/Dec/13 6:16 AM I have added the regression test for this issue http://review.whamcloud.com/8463 . I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall().

Patrick Farrell (Inactive) added a comment - 02/Dec/13 3:30 PM

Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them.

Patrick Farrell (Inactive) added a comment - 02/Dec/13 3:30 PM Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them.

Jinshan Xiong (Inactive) added a comment - 01/Dec/13 11:41 PM

Hi Patrick, it's fine to take out of OSC changes.

Jinshan Xiong (Inactive) added a comment - 01/Dec/13 11:41 PM Hi Patrick, it's fine to take out of OSC changes.

Patrick Farrell (Inactive) added a comment - 27/Nov/13 10:33 PM

We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen.

We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead.

Patrick Farrell (Inactive) added a comment - 27/Nov/13 10:33 PM We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen. We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead.

Patrick Farrell (Inactive) added a comment - 27/Nov/13 8:43 PM

Jinshan,

We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356

This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems.

We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything.

Patrick Farrell (Inactive) added a comment - 27/Nov/13 8:43 PM Jinshan, We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356 This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems. We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything.

Jinshan Xiong (Inactive) added a comment - 27/Nov/13 8:37 PM

Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all.

We're going to try without the OSC changes next.

Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion?

Jinshan Xiong (Inactive) added a comment - 27/Nov/13 8:37 PM Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all. We're going to try without the OSC changes next. Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion?

Alexey Lyashkov added a comment - 27/Nov/13 7:35 PM

Jay,

can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that?

Alexey Lyashkov added a comment - 27/Nov/13 7:35 PM Jay, can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that?

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 05/Sep/13 5:28 PM

Updated:: 16/Jun/14 7:16 PM

Resolved:: 04/Jan/14 2:42 PM