Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
-
CentOS 6.4 running fairly recent master.
-
3
-
10138
Description
This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.
This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)
Two things are needed: A reproducer script, and memory pressure on the system.
The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]
—
for idx in $(seq 0 10000); do time ls -laR > /dev/null touch somefile rm -f somefiles echo $idx: $(date +%T) $(grep MemFree /proc/meminfo) done
—
I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.
Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.
#include <stdio.h> #include <unistd.h> int main() { int i; char* junk; start: i = 0; while(i < 50) { printf("Malloc!\n"); junk = malloc(1024*1024*1024); junk[0] = i; i++; } printf("Mallocced 50 GB. Press enter to malloc another 50.\n"); printf("Note: This seems to use roughly 10 MB of real memory each time.\n"); getchar(); goto start; }
Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
As per my understanding of the code, osc_lock_enqueue() function enqueue the
lock and do not wait for network communication.After reply from server we
execute the call back function i.e. osc_lock_upcall() for the lock enqueue
through osc_lock_enqueue().
In this case after successful enqueue and before we get reply from server
(or call to the osc_lock_upcall()), I see in the log that we unused the lock
and hence the LBUG.
I will investigate more and update accordingly.
Attachments
Issue Links
- is duplicated by
-
LU-4394 LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed
- Resolved
- is related to
-
LU-3027 Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8
- Resolved
-
LU-3433 Encountered a assertion for the ols_state being set to a impossible state
- Resolved
- is related to
-
LU-3027 Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8
- Resolved