[LU-6918] Deadlock on transaction with iget()/clear_inode() Created: 28/Jul/15 Updated: 19/Dec/17 Resolved: 19/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Andriy Skulysh | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Thread 1: schedule start_this_handle jbd2_journal_start ldiskfs_journal_start_sb ldiskfs_dquot_drop vfs_dq_drop clear_inode dispose_list shrink_icache_memory shrink_slab zone_reclaim get_page_from_freelist __alloc_pages_nodemask alloc_pages_vma do_huge_pmd_anonymous_page handle_mm_fault __do_page_fault do_page_fault page_fault Thread2: __wait_on_freeing_inode find_inode_fast ifind_fast iget_locked ldiskfs_iget osd_iget osd_index_ea_delete out_obj_index_delete out_tx_index_delete_exec out_tx_end out_handle tgt_request_handle ptlrpc_main kthread kernel_thread |
| Comments |
| Comment by Andriy Skulysh [ 28/Jul/15 ] |
|
iget() waits for cleared I_FREEING flag, but clear_inode()/ldiskfs_dquot_drop() wants to start transaction at first and clears the flag after that. This behavior is common for most of target code. It takes transaction at first, after that locates an object by means of iget(). |
| Comment by Alex Zhuravlev [ 28/Jul/15 ] |
|
no, we shouldn't do this in the target code because the target has no idea of agent inodes and it can't address that inode given an agent inode has no FID assigned. I think the only solution is to postpone inode destroy. this can be done in different ways. the most trivial is to have a list of inode numbers in memory. this can lead to an orphan, but given the number of agent inodes is very small, they don't occupy much space and at some point they will be discovered by LFSCK - probably good enough. if not, then we can do something similar to ext4_orphan_add().. |
| Comment by Andreas Dilger [ 06/Aug/15 ] |
|
Andriy, how easily can this deadlock be hit, and what is the workload to trigger it? |
| Comment by Andriy Skulysh [ 12/Aug/15 ] |
|
We have several reports from different sites. It isn't easily reproducible. |
| Comment by Alex Zhuravlev [ 12/Aug/15 ] |
| Comment by Peter Jones [ 19/Dec/17 ] |
|
Duplicate of |