Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.7, Lustre 2.12.8
-
None
-
3
-
9223372036854775807
Description
Write and Truncate IO will serialized on ll_trunc_sem::ll_trunc_{readers|waiters}, if one process quit abruptly (be killed), the other will keep waiting for the semaphore (task state be set as TASK_INTERRUPTIBLE):
INFO: task a.out:109684 blocked for more than 120 seconds. Tainted: G IOE --------- - - 4.18.0-240.15.1.el8_3.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Call Trace: __schedule+0x2a6/0x700 schedule+0x38/0xa0 trunc_sem_down_read+0xa6/0xb0 [lustre] vvp_io_write_start+0x107/0xb80 [lustre] cl_io_start+0x59/0x110 [obdclass] cl_io_loop+0x9a/0x1e0 [obdclass] ll_file_io_generic+0x380/0xb10 [lustre] ll_file_write_iter+0x136/0x5a0 [lustre] new_sync_write+0x124/0x170 vfs_write+0xa5/0x1a0 ksys_write+0x4f/0xb0 do_syscall_64+0x5b/0x1a0
Attachments
Issue Links
- is related to
-
LU-15397 LustreError: 4585:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&mm->mmap_sem) ) failed
-
- Open
-
Activity
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47404
Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: cdf66f48b5100b0b209be3f1741e6163000299e2
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43844/
Subject: LU-14713 llite: mend the trunc_sem_up_write()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 39745c8b5493159bbca62add54ca9be7cac6564f
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44715/
Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 81aec05103558f57adb10fd319847304cdf44aa7
As ll_fault0() shows, if ll_filemap_fault() returns VM_FAULT_RETRY, it presumes mmap_sem has not dropped and continues to normal fault under DLM lock, which will eventually drop the mmap_sem.
279 if (ll_sbi_has_fast_read(ll_i2sbi(file_inode(vma->vm_file)))) { 280 /* do fast fault */ 281 bool has_retry = vmf->flags & FAULT_FLAG_RETRY_NOWAIT; 282 283 /* To avoid loops, instruct downstream to not drop mmap_sem */ 284 vmf->flags |= FAULT_FLAG_RETRY_NOWAIT; 285 ll_cl_add(vma->vm_file, env, NULL, LCC_MMAP); 286 fault_ret = ll_filemap_fault(vma, vmf); 287 ll_cl_remove(vma->vm_file, env); 288 if (!has_retry) 289 vmf->flags &= ~FAULT_FLAG_RETRY_NOWAIT; 290 291 /* - If there is no error, then the page was found in cache and 292 * uptodate; 293 * - If VM_FAULT_RETRY is set, the page existed but failed to 294 * lock. We will try slow path to avoid loops. 295 * - Otherwise, it should try normal fault under DLM lock. */ 296 if (!(fault_ret & VM_FAULT_RETRY) && 297 !(fault_ret & VM_FAULT_ERROR)) 298 GOTO(out, result = 0); 299 300 fault_ret = 0; 301 }
But filemap_fault() shows that returning VM_FAULT_RETRY while not dopping mmap_sem needs vmf->flags contains both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT, otherwise it could possibly return VM_FAULT_RETRY with mmap_sem released.
2421 if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { 2422 page_cache_release(page); 2423 return ret | VM_FAULT_RETRY; 2424 }
410 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, 411 unsigned int flags) 412 { 413 might_sleep(); 414 return trylock_page(page) || __lock_page_or_retry(page, mm, flags); 415 }
854 int __lock_page_or_retry(struct page *page, struct mm_struct *mm, 855 unsigned int flags) 856 { 857 if (flags & FAULT_FLAG_ALLOW_RETRY) { 858 /* 859 * CAUTION! In this case, mmap_sem is not released 860 * even though return 0. 861 */ 862 if (flags & FAULT_FLAG_RETRY_NOWAIT) 863 return 0; 864 865 up_read(&mm->mmap_sem); 866 if (flags & FAULT_FLAG_KILLABLE) 867 wait_on_page_locked_killable(page); 868 else 869 wait_on_page_locked(page); 870 return 0; 871 } else { 872 if (flags & FAULT_FLAG_KILLABLE) { 873 int ret; 874 875 ret = __lock_page_killable(page); 876 if (ret) { 877 up_read(&mm->mmap_sem); 878 return 0; 879 } 880 } else 881 __lock_page(page); 882 return 1; 883 } 884 }
"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/44715
Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0cccc61c4bd48cc6942963394cd2ea70e30b8563
Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/44372
Subject: LU-14713 llite: add memory barriier to the trunc_sem.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1b20297bce772534dd557abaa11859447da8651b
It seems I was wrong about atomic_set implying a barrier, but I'm slightly less convinced that x86 doesn't require them.
So I'd still be very surprised if barriers were causing the problem, but there is something worth fixing there.
The atomic_set in trunc_sem_uip_write should be atomic_set_release(), and that atomic_reads should be atomic_read_acquire().
I would still like to see the exact source code for the kernel modules which produced the hang, if that is possible.
Neil,
I'm not familiar with the memory barrier while I noticed a kernel comment about wake_up_var()->__wake_up_bit()->waitqueue_active()
97 /** 98 * waitqueue_active -- locklessly test for waiters on the queue 99 * @wq_head: the waitqueue to test for waiters 100 * 101 * returns true if the wait list is not empty 102 * 103 * NOTE: this function is lockless and requires care, incorrect usage _will_ 104 * lead to sporadic and non-obvious failure. 105 * 106 * Use either while holding wait_queue_head::lock or when used for wakeups 107 * with an extra smp_mb() like:: 108 * 109 * CPU0 - waker CPU1 - waiter 110 * 111 * for (;;) { 112 * @cond = true; prepare_to_wait(&wq_head, &wait, state); 113 * smp_mb(); // smp_mb() from set_current_state() 114 * if (waitqueue_active(wq_head)) if (@cond) 115 * wake_up(wq_head); break; 116 * schedule(); 117 * } 118 * finish_wait(&wq_head, &wait); 119 * 120 * Because without the explicit smp_mb() it's possible for the 121 * waitqueue_active() load to get hoisted over the @cond store such that we'll 122 * observe an empty wait list while the waiter might not observe @cond.
Does that mean there should be smp_mb() in trunc_sem_up_write() before wake_up_var()?
347 static inline void trunc_sem_up_write(struct ll_trunc_sem *sem)
348 {
349 atomic_set(&sem->ll_trunc_readers, 0);
350 wake_up_var(&sem->ll_trunc_readers);
351 }
We likely hit this issue at the CEA in production.
After upgrading from 2.12.6 to 2.12.7, we observed regularly hangs on the clients when canceling a job.
The hang is observed on mmap_sem when closing the /dev/infiniband/uverb* device:
MOFED: 5.4
kernel: 3.10
lustre: client 2.12.7
So I am not 100% sure, but I think we hit the regression from 21dc165991 "
LU-13182llite: Avoid eternel retry loops with MAP_POPULATE" (landed in 2.12.7).This patch could cause mmap_sem to be released (up_read) twice (only for kernel < 5.1).