[LU-5700] async IO LBUG obj->cob_transient_pages == 0 Created: 02/Oct/14 Updated: 20/Oct/14 Resolved: 20/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephen Champion | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15957 | ||||||||
| Description |
|
LustreError: 58185:0:(rw26.c:474:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed: Call Trace: Call Trace: Call Trace: [<ffffffff81058bd3>] ? __wake_up+0x53/0x70 [<ffffffffa06f14da>] cl_io_start+0x6a/0x140 [obdclass] Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Stephen Champion [ 02/Oct/14 ] |
|
This is easy to reproduce using sio on an 8 socket UV system with 160 threads and 4 tb of memory. I have been unable to reproduce this on a smaller configuration thus far. root@cy024-4-sys:/mnt/cy024/schamp # ./sio -tDcw -b 2048 -A 16 -s 83865632 -g /mnt/cy024/schamp/sio.1 3 OSS with five OST each, stripe size -1 :
We have cores if they are useful. Olaf Weber's notes: There are four threads of interest, with stack traces like this showing they're doing direct IO. PID: 58183 TASK: ffff893de6382080 CPU: 151 COMMAND: "sio" Approximately they're here: lustre/llite/rw26.c
They are all referencing the same obj crash> ccc_object.cob_transient_pages ffff88bdd31dd078 Looking at how cob_transient_pages is manipulated: lustre/include/lclient.h: Protected by i_sem? A very bad sign, as i_sem has been replaced by i_mutex for Which points to commit ed5ebb87bfc2b684958daac90c4369f395482a16, part of which is this: diff --git a/lustre/llite/rw26.c b/lustre/llite/rw26.c
if (tot_bytes > 0) { |
| Comment by Olaf Weber [ 02/Oct/14 ] |
|
A bit more info from the commit header: the change was commit ed5ebb87bfc2b684958daac90c4369f395482a16 |
| Comment by Peter Jones [ 02/Oct/14 ] |
|
Lai Could you please advise on this issue? Thanks Peter |
| Comment by Stephen Champion [ 03/Oct/14 ] |
|
Our test system had another crash with normal (not async) DIO last night. This led to http://review.whamcloud.com/#/c/12179/ Which lets us get on with the performance evaluation we were doing. The patch removes the assertions and makes the counter atomic. We are not convinced that this section is safe for multiple threads - it needs to be considered carefully. The patch should not introduce any new bugs, but may expose something by not crashing. |
| Comment by John Fuchs-Chesney (Inactive) [ 20/Oct/14 ] |
|
Stephen, Do you need any more work on this ticket, or can I mark it as resolved, fixed? Thanks, |
| Comment by Stephen Champion [ 20/Oct/14 ] |
|
Works for now. We're not entirely convinced that this section of code is safe for reentry, but this clearly fixed one problem, and we can open a new ticket if we identify another. |
| Comment by Peter Jones [ 20/Oct/14 ] |
|
ok thanks Steve |