[LU-3379] osd_attr_get() ASSERTION( dt_object_exists(dt) ) failed Created: 22/May/13 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Story | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
server: lustre-2.3.63-6chaos_2.6.32_358.6z2.1.1chaos.ch5.1.ch5.1.x86_64 |
||
| Attachments: |
|
| Rank (Obsolete): | 8365 |
| Description |
|
Lustre debug log and console log attached. Note this MDS hosts two filesystems. One was formatted with Lustre 2.3.63 and is mounted rw by clients. The other was formatted with pre-2.3.58 and is mounted ro by clients to allow data migration to the newer filesystem. The older filesystem underwent a manual procedure to correct on-disk-format incompatibilities related to PID: 21576 TASK: ffff880bc7416ae0 CPU: 13 COMMAND: "mdt03_066" #0 [ffff880be2a6b2f8] machine_kexec at ffffffff81035bab #1 [ffff880be2a6b358] crash_kexec at ffffffff810c0932 #2 [ffff880be2a6b428] panic at ffffffff8150d723 #3 [ffff880be2a6b4a8] lbug_with_loc at ffffffffa06fbf4b [libcfs] #4 [ffff880be2a6b4c8] osd_attr_get at ffffffffa0e5a292 [osd_zfs] #5 [ffff880be2a6b4f8] llog_osd_write_rec at ffffffffa0859330 [obdclass] #6 [ffff880be2a6b588] llog_write_rec at ffffffffa0827658 [obdclass] #7 [ffff880be2a6b5e8] llog_write at ffffffffa0828e55 [obdclass] #8 [ffff880be2a6b658] llog_cancel_rec at ffffffffa082905c [obdclass] #9 [ffff880be2a6b6a8] llog_cat_cancel_records at ffffffffa082e41e [obdclass] #10 [ffff880be2a6b708] llog_changelog_cancel_cb at ffffffffa0f1f661 [mdd] #11 [ffff880be2a6b768] llog_process_thread at ffffffffa0829dfb [obdclass] #12 [ffff880be2a6b818] llog_process_or_fork at ffffffffa082bb0d [obdclass] #13 [ffff880be2a6b868] llog_cat_process_cb at ffffffffa082e272 [obdclass] #14 [ffff880be2a6b8b8] llog_process_thread at ffffffffa0829dfb [obdclass] #15 [ffff880be2a6b968] llog_process_or_fork at ffffffffa082bb0d [obdclass] #16 [ffff880be2a6b9b8] llog_cat_process_or_fork at ffffffffa082c5b9 [obdclass] #17 [ffff880be2a6ba48] llog_cat_process at ffffffffa082c7c9 [obdclass] #18 [ffff880be2a6ba68] llog_changelog_cancel at ffffffffa0f1f3bf [mdd] #19 [ffff880be2a6baa8] llog_cancel at ffffffffa082fe58 [obdclass] #20 [ffff880be2a6bae8] mdd_changelog_llog_cancel at ffffffffa0f2440e [mdd] #21 [ffff880be2a6bb28] mdd_changelog_user_purge at ffffffffa0f24880 [mdd] #22 [ffff880be2a6bb78] mdd_iocontrol at ffffffffa0f24d03 [mdd] #23 [ffff880be2a6bbd8] mdt_ioc_child at ffffffffa0f782c9 [mdt] #24 [ffff880be2a6bc58] mdt_iocontrol at ffffffffa0f7c418 [mdt] #25 [ffff880be2a6bcf8] mdt_set_info at ffffffffa0f7cab6 [mdt] #26 [ffff880be2a6bd48] mdt_handle_common at ffffffffa0f80078 [mdt] #27 [ffff880be2a6bd98] mds_regular_handle at ffffffffa0fbc235 [mdt] #28 [ffff880be2a6bda8] ptlrpc_server_handle_request at ffffffffa0a1250c [ptlrpc] #29 [ffff880be2a6bea8] ptlrpc_main at ffffffffa0a13a55 [ptlrpc] #30 [ffff880be2a6bf48] kernel_thread at ffffffff8100c0ca LLNL-bug-id: TOSS-2074 |
| Comments |
| Comment by Peter Jones [ 22/May/13 ] |
|
Alex Are you able to comment on this one? Peter |
| Comment by Ned Bass [ 22/May/13 ] |
|
A note from Marc about the client load.
|
| Comment by Alex Zhuravlev [ 22/May/13 ] |
|
Ned, can you tell how changelogs are used in that setup? at first it looks like there were two processes trying to cancel the records. |
| Comment by Ned Bass [ 22/May/13 ] |
|
Alex, it is RobinHood consuming the Changelogs. I'm not very familiar with it; I just know that it uses Changelogs to monitor and store some metadata for all files in the filesystem. |
| Comment by Ned Bass [ 22/May/13 ] |
|
I believe RobinHood is using multiple threads to process Changelogs. I'm not sure how the threads cooridnate with eachother, but perhaps it's possible that more than one thread may try to cancel the same record. |
| Comment by Ned Bass [ 23/May/13 ] |
|
Alex,
Is that speculative, or do see something in the logs to suggest this? The "Rewriting changelog user 1 endrec to 302374274" debug log entries all contain unique record numbers (usually 100 apart), so I'm curious what makes you say this. Any idea why that would not be handled safely? Thanks |
| Comment by Alex Zhuravlev [ 24/May/13 ] |
|
so, 00000004:00000080:15.0:1369025236.132203:0:21576:0:(mdd_device.c:1577:mdd_changelog_user_purge()) Purge request: id=1, endrec=302369333 so, 21576 started to purge upto 302369333 and got stuck (blocked, scheduled out) at some point. then we see that 19361 got to purge upto 302374174 and completed as there was another request 19361 started to handle after that. in the mean time 21576 was idling/blocked/etc and by the time it's back the original llog might be destroyed already. though I'm still missing couple moments in the picture.. |
| Comment by Alex Zhuravlev [ 24/May/13 ] |
|
hard to believe, but according to the log 21576 was blocked for ~927s. |
| Comment by Ned Bass [ 24/May/13 ] |
|
Hi Alex, For some reason many threads were blocking in txg_wait_open(), as you can see in the console log. Unfortunately we don't log /proc/spl/zfs/kstat/dmu_tx which would tell us why. But 21576 must have blocked for some other reason since many txg's must have opened and closed during the intervening llog cancels. Or maybe, somehow it missed a wake up call when its txg opened? |
| Comment by Alex Zhuravlev [ 24/May/13 ] |
|
I wish I knew |
| Comment by Alex Zhuravlev [ 27/May/13 ] |
|
Mike, do you think it's OK to just hold a semaphore other the whole cancellation process? given it's going upto a highmark, I'd think it's fine and even probably better - due to less competing. |
| Comment by Alex Zhuravlev [ 16/Apr/20 ] |
|
the llog code got a lots of changes in this area, likely this issue has been fixed.. |