Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14713

Process hung with waiting for mmap_sem

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.7, Lustre 2.12.8
    • None
    • 3
    • 9223372036854775807

    Description

      Write and Truncate IO will serialized on ll_trunc_sem::ll_trunc_{readers|waiters}, if one process quit abruptly (be killed), the other will keep waiting for the semaphore (task state be set as TASK_INTERRUPTIBLE):

       INFO: task a.out:109684 blocked for more than 120 seconds.
            Tainted: G          IOE    --------- -  - 4.18.0-240.15.1.el8_3.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       Call Trace:
       __schedule+0x2a6/0x700
       schedule+0x38/0xa0
       trunc_sem_down_read+0xa6/0xb0 [lustre]
       vvp_io_write_start+0x107/0xb80 [lustre]
       cl_io_start+0x59/0x110 [obdclass]
       cl_io_loop+0x9a/0x1e0 [obdclass]
       ll_file_io_generic+0x380/0xb10 [lustre]
       ll_file_write_iter+0x136/0x5a0 [lustre]
       new_sync_write+0x124/0x170
       vfs_write+0xa5/0x1a0
       ksys_write+0x4f/0xb0
       do_syscall_64+0x5b/0x1a0
      

      Attachments

        Issue Links

          Activity

            [LU-14713] Process hung with waiting for mmap_sem

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44715/
            Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 81aec05103558f57adb10fd319847304cdf44aa7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44715/ Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem Project: fs/lustre-release Branch: master Current Patch Set: Commit: 81aec05103558f57adb10fd319847304cdf44aa7
            bobijam Zhenyu Xu added a comment -

            As ll_fault0() shows, if ll_filemap_fault() returns VM_FAULT_RETRY, it presumes mmap_sem has not dropped and continues to normal fault under DLM lock, which will eventually drop the mmap_sem.

            ll_fault0()
            279         if (ll_sbi_has_fast_read(ll_i2sbi(file_inode(vma->vm_file)))) {         
            280                 /* do fast fault */                                             
            281                 bool has_retry = vmf->flags & FAULT_FLAG_RETRY_NOWAIT;          
            282                                                                                 
            283                 /* To avoid loops, instruct downstream to not drop mmap_sem */  
            284                 vmf->flags |= FAULT_FLAG_RETRY_NOWAIT;                          
            285                 ll_cl_add(vma->vm_file, env, NULL, LCC_MMAP);                   
            286                 fault_ret = ll_filemap_fault(vma, vmf);                         
            287                 ll_cl_remove(vma->vm_file, env);                                
            288                 if (!has_retry)                                                 
            289                         vmf->flags &= ~FAULT_FLAG_RETRY_NOWAIT;                 
            290                                                                                 
            291                 /* - If there is no error, then the page was found in cache and 
            292                  *   uptodate;                                                  
            293                  * - If VM_FAULT_RETRY is set, the page existed but failed to   
            294                  *   lock. We will try slow path to avoid loops.                
            295                  * - Otherwise, it should try normal fault under DLM lock. */   
            296                 if (!(fault_ret & VM_FAULT_RETRY) &&                            
            297                     !(fault_ret & VM_FAULT_ERROR))                              
            298                         GOTO(out, result = 0);                                  
            299                                                                                 
            300                 fault_ret = 0;                                                  
            301         }            

            But filemap_fault() shows that returning VM_FAULT_RETRY while not dopping mmap_sem needs vmf->flags contains both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT, otherwise it could possibly return VM_FAULT_RETRY with mmap_sem released.

            filemap_fault()
            2421         if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {                
            2422                 page_cache_release(page);                                       
            2423                 return ret | VM_FAULT_RETRY;                                            
            2424         }      
            410 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,   
            411                                      unsigned int flags)                                 
            412 {                                                                               
            413         might_sleep();                                                          
            414         return trylock_page(page) || __lock_page_or_retry(page, mm, flags);     
            415 } 
            
             854 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,               
             855                          unsigned int flags)                                    
             856 {                                                                               
             857         if (flags & FAULT_FLAG_ALLOW_RETRY) {                                   
             858                 /*                                                              
             859                  * CAUTION! In this case, mmap_sem is not released              
             860                  * even though return 0.                                        
             861                  */                                                             
             862                 if (flags & FAULT_FLAG_RETRY_NOWAIT)                            
             863                         return 0;                                               
             864                                                                                 
             865                 up_read(&mm->mmap_sem);                                         
             866                 if (flags & FAULT_FLAG_KILLABLE)                                
             867                         wait_on_page_locked_killable(page);                     
             868                 else                                                            
             869                         wait_on_page_locked(page);                                      
             870                 return 0;  
             871         } else {                                                                
             872                 if (flags & FAULT_FLAG_KILLABLE) {                              
             873                         int ret;                                                
             874                                                                                 
             875                         ret = __lock_page_killable(page);                       
             876                         if (ret) {                                              
             877                                 up_read(&mm->mmap_sem);                         
             878                                 return 0;                                               
             879                         }                                                       
             880                 } else                                                          
             881                         __lock_page(page);                                      
             882                 return 1;                                                       
             883         }     
             884 }                           
             
            bobijam Zhenyu Xu added a comment - As ll_fault0() shows, if ll_filemap_fault() returns VM_FAULT_RETRY, it presumes mmap_sem has not dropped and continues to normal fault under DLM lock, which will eventually drop the mmap_sem. ll_fault0() 279 if (ll_sbi_has_fast_read(ll_i2sbi(file_inode(vma->vm_file)))) { 280 /* do fast fault */ 281 bool has_retry = vmf->flags & FAULT_FLAG_RETRY_NOWAIT; 282 283 /* To avoid loops, instruct downstream to not drop mmap_sem */ 284 vmf->flags |= FAULT_FLAG_RETRY_NOWAIT; 285 ll_cl_add(vma->vm_file, env, NULL, LCC_MMAP); 286 fault_ret = ll_filemap_fault(vma, vmf); 287 ll_cl_remove(vma->vm_file, env); 288 if (!has_retry) 289 vmf->flags &= ~FAULT_FLAG_RETRY_NOWAIT; 290 291 /* - If there is no error, then the page was found in cache and 292 * uptodate; 293 * - If VM_FAULT_RETRY is set, the page existed but failed to 294 * lock. We will try slow path to avoid loops. 295 * - Otherwise, it should try normal fault under DLM lock. */ 296 if (!(fault_ret & VM_FAULT_RETRY) && 297 !(fault_ret & VM_FAULT_ERROR)) 298 GOTO(out, result = 0); 299 300 fault_ret = 0; 301 } But filemap_fault() shows that returning VM_FAULT_RETRY while not dopping mmap_sem needs vmf->flags contains both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT, otherwise it could possibly return VM_FAULT_RETRY with mmap_sem released. filemap_fault() 2421 if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { 2422 page_cache_release(page); 2423 return ret | VM_FAULT_RETRY; 2424 } 410 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, 411 unsigned int flags) 412 { 413 might_sleep(); 414 return trylock_page(page) || __lock_page_or_retry(page, mm, flags); 415 } 854 int __lock_page_or_retry(struct page *page, struct mm_struct *mm, 855 unsigned int flags) 856 { 857 if (flags & FAULT_FLAG_ALLOW_RETRY) { 858 /* 859 * CAUTION! In this case , mmap_sem is not released 860 * even though return 0. 861 */ 862 if (flags & FAULT_FLAG_RETRY_NOWAIT) 863 return 0; 864 865 up_read(&mm->mmap_sem); 866 if (flags & FAULT_FLAG_KILLABLE) 867 wait_on_page_locked_killable(page); 868 else 869 wait_on_page_locked(page); 870 return 0; 871 } else { 872 if (flags & FAULT_FLAG_KILLABLE) { 873 int ret; 874 875 ret = __lock_page_killable(page); 876 if (ret) { 877 up_read(&mm->mmap_sem); 878 return 0; 879 } 880 } else 881 __lock_page(page); 882 return 1; 883 } 884 }

            "Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/44715
            Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0cccc61c4bd48cc6942963394cd2ea70e30b8563

            gerrit Gerrit Updater added a comment - "Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/44715 Subject: LU-14713 llite: tighten condition for fault not drop mmap_sem Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0cccc61c4bd48cc6942963394cd2ea70e30b8563

            Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/44372
            Subject: LU-14713 llite: add memory barriier to the trunc_sem.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1b20297bce772534dd557abaa11859447da8651b

            gerrit Gerrit Updater added a comment - Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/44372 Subject: LU-14713 llite: add memory barriier to the trunc_sem. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1b20297bce772534dd557abaa11859447da8651b
            neilb Neil Brown added a comment -

            It seems I was wrong about atomic_set implying a barrier, but I'm slightly less convinced that x86 doesn't require them.

            So I'd still be very surprised if barriers were causing the problem, but there is something worth fixing there.

             

            The atomic_set in trunc_sem_uip_write should be atomic_set_release(), and  that atomic_reads should be atomic_read_acquire().

             

            I would still like to see the exact source code for the kernel modules which produced the hang, if that is possible.

             

            neilb Neil Brown added a comment - It seems I was wrong about atomic_set implying a barrier, but I'm slightly less convinced that x86 doesn't require them. So I'd still be very surprised if barriers were causing the problem, but there is something worth fixing there.   The atomic_set in trunc_sem_uip_write should be atomic_set_release(), and  that atomic_reads should be atomic_read_acquire().   I would still like to see the exact source code for the kernel modules which produced the hang, if that is possible.  
            bobijam Zhenyu Xu added a comment - - edited

            Neil,

            I'm not familiar with the memory barrier while I noticed a kernel comment about wake_up_var()->__wake_up_bit()->waitqueue_active()

              97 /**                                                                             
              98  * waitqueue_active -- locklessly test for waiters on the queue                 
              99  * @wq_head: the waitqueue to test for waiters                                  
             100  *                                                                              
             101  * returns true if the wait list is not empty                                   
             102  *                                                                              
             103  * NOTE: this function is lockless and requires care, incorrect usage _will_    
             104  * lead to sporadic and non-obvious failure.                                    
             105  *                                                                              
             106  * Use either while holding wait_queue_head::lock or when used for wakeups      
             107  * with an extra smp_mb() like::                                                
             108  *                                                                              
             109  *      CPU0 - waker                    CPU1 - waiter                           
             110  *                                                                              
             111  *                                      for (;;) {                              
             112  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
             113  *      smp_mb();                         // smp_mb() from set_current_state()  
             114  *      if (waitqueue_active(wq_head))         if (@cond)                       
             115  *        wake_up(wq_head);                      break;                         
             116  *                                        schedule();                           
             117  *                                      }                                       
             118  *                                      finish_wait(&wq_head, &wait);           
             119  *                                                                              
             120  * Because without the explicit smp_mb() it's possible for the                  
             121  * waitqueue_active() load to get hoisted over the @cond store such that we'll  
             122  * observe an empty wait list while the waiter might not observe @cond.  

            Does that mean there should be smp_mb() in trunc_sem_up_write() before wake_up_var()?

            347 static inline void trunc_sem_up_write(struct ll_trunc_sem *sem)                 
             348 {                                                                               
             349         atomic_set(&sem->ll_trunc_readers, 0);                                                       
             350         wake_up_var(&sem->ll_trunc_readers);                                    
             351 } 
            bobijam Zhenyu Xu added a comment - - edited Neil, I'm not familiar with the memory barrier while I noticed a kernel comment about wake_up_var()->__wake_up_bit()->waitqueue_active() 97 /** 98 * waitqueue_active -- locklessly test for waiters on the queue 99 * @wq_head: the waitqueue to test for waiters 100 * 101 * returns true if the wait list is not empty 102 * 103 * NOTE: this function is lockless and requires care, incorrect usage _will_ 104 * lead to sporadic and non-obvious failure. 105 * 106 * Use either while holding wait_queue_head::lock or when used for wakeups 107 * with an extra smp_mb() like:: 108 * 109 * CPU0 - waker CPU1 - waiter 110 * 111 * for (;;) { 112 * @cond = true; prepare_to_wait(&wq_head, &wait, state); 113 * smp_mb(); // smp_mb() from set_current_state() 114 * if (waitqueue_active(wq_head)) if (@cond) 115 * wake_up(wq_head); break; 116 * schedule(); 117 * } 118 * finish_wait(&wq_head, &wait); 119 * 120 * Because without the explicit smp_mb() it's possible for the 121 * waitqueue_active() load to get hoisted over the @cond store such that we'll 122 * observe an empty wait list while the waiter might not observe @cond. Does that mean there should be smp_mb() in trunc_sem_up_write() before wake_up_var()? 347 static inline void trunc_sem_up_write(struct ll_trunc_sem *sem) 348 { 349 atomic_set(&sem->ll_trunc_readers, 0); 350 wake_up_var(&sem->ll_trunc_readers); 351 }
            neilb Neil Brown added a comment -

            I'm confident that memory barriers aren't a problem.  atomic_set() and atomic_read() provide any memory barriers they need, and none are needed on x86.

            So I'm stumped - for today at least.

            ll_trunc_readers is definitely 0 and whenever it is set to zero a wake_up happens.  But the waiter didn't wake up.

            I'd still like to see the exact lustre source code for these modules, but I doubt it will show anything.

             

            neilb Neil Brown added a comment - I'm confident that memory barriers aren't a problem.  atomic_set() and atomic_read() provide any memory barriers they need, and none are needed on x86. So I'm stumped - for today at least. ll_trunc_readers is definitely 0 and whenever it is set to zero a wake_up happens.  But the waiter didn't wake up. I'd still like to see the exact lustre source code for these modules, but I doubt it will show anything.  
            neilb Neil Brown added a comment -

            Looking at the vmcore  "bt -FF 4718" show

             

            #1 [ffffbe8540f3bb88] schedule at ffffffff8f6d38a8
            ffffbe8540f3bb90: [ffff9f9b3d8c37a0:lustre_inode_cache] trunc_sem_down_read+166
            #2 [ffffbe8540f3bb98] trunc_sem_down_read at ffffffffc12523c6 [lustre]
            ffffbe8540f3bba0: [ffff9f9b3d8c37a0:lustre_inode_cache] 00000000ffffffff

             

            So ffffbe8540f3bba0 must be the address of the ll_trunc_sem.

            crash> x/2wx 0xffff9f9b3d8c37a0
            0xffff9f9b3d8c37a0: 0x00000000 0x00000000

            so both ll_trunc_waiters and ll_trunc_readers are zero.  So trunc_sem_down_read() shouldn't block.

            This suggests a missed wake-up.  It could only be the wakeup from trunc_sem_up_write().  Maybe a barrier is needed after the atomic_set, and before the atomic_add_unless_negative.

            But I thought barriers like that weren't needed on x86.

            I'll read up about memory ordering again.

             

            neilb Neil Brown added a comment - Looking at the vmcore  "bt -FF 4718" show   #1 [ffffbe8540f3bb88] schedule at ffffffff8f6d38a8 ffffbe8540f3bb90: [ffff9f9b3d8c37a0:lustre_inode_cache] trunc_sem_down_read+166 #2 [ffffbe8540f3bb98] trunc_sem_down_read at ffffffffc12523c6 [lustre] ffffbe8540f3bba0: [ffff9f9b3d8c37a0:lustre_inode_cache] 00000000ffffffff   So ffffbe8540f3bba0 must be the address of the ll_trunc_sem. crash> x/2wx 0xffff9f9b3d8c37a0 0xffff9f9b3d8c37a0: 0x00000000 0x00000000 so both ll_trunc_waiters and ll_trunc_readers are zero.  So trunc_sem_down_read() shouldn't block. This suggests a missed wake-up.  It could only be the wakeup from trunc_sem_up_write().  Maybe a barrier is needed after the atomic_set, and before the atomic_add_unless_negative. But I thought barriers like that weren't needed on x86. I'll read up about memory ordering again.  
            bobijam Zhenyu Xu added a comment -

            I can open it, but I'm having difficulty to find 'struct ll_trunc_sem'. vmcore is available from:

            same here.

            crash> struct ll_trunc_sem
            struct: invalid data structure reference: ll_trunc_sem

            seems lack of lustre module debug info.

            rpm -qpl lustre-client-debuginfo-2.12.6_ddn19-1.el8.x86_64.rpm

            shows that it does not contain lustre module info, only some tools.

            bobijam Zhenyu Xu added a comment - I can open it, but I'm having difficulty to find 'struct ll_trunc_sem'. vmcore is available from: same here. crash> struct ll_trunc_sem struct: invalid data structure reference: ll_trunc_sem seems lack of lustre module debug info. rpm -qpl lustre-client-debuginfo-2.12.6_ddn19-1.el8.x86_64.rpm shows that it does not contain lustre module info, only some tools.
            neilb Neil Brown added a comment -

            > The reported version is 2.12.6-ddn19

            The public "lustre-release" git tree doesn't have any branch or tag with a name like that, and the public "2.12.6" doesn't contain trunc_sem_down_read() at all.

            Is there somewhere I can get the source code for that release?

            neilb Neil Brown added a comment - > The reported version is 2.12.6-ddn19 The public "lustre-release" git tree doesn't have any branch or tag with a name like that, and the public "2.12.6" doesn't contain trunc_sem_down_read() at all. Is there somewhere I can get the source code for that release?
            bobijam Zhenyu Xu added a comment -

            The reported version is 2.12.6-ddn19 (both client and server). And unfortunately I failed to load teh vmcore with crqsh

            # crash ./120035_YITP_client_ioerror/usr/lib/debug/lib/modules/4.18.0-240.15.1.el8_3.x86_64/vmlinux 120035_YITP_client_ioerror/vmcore 
            
            
            crash 7.2.3-11.el7_9.1
            Copyright (C) 2002-2017  Red Hat, Inc.
            Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
            Copyright (C) 1999-2006  Hewlett-Packard Co
            Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
            Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
            Copyright (C) 2005, 2011  NEC Corporation
            Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
            Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
            This program is free software, covered by the GNU General Public License,
            and you are welcome to change it and/or distribute copies of it under
            certain conditions.  Enter "help copying" to see the conditions.
            This program has absolutely no warranty.  Enter "help warranty" for details.
             
            GNU gdb (GDB) 7.6
            Copyright (C) 2013 Free Software Foundation, Inc.
            License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
            This is free software: you are free to change and redistribute it.
            There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
            and "show warranty" for details.
            This GDB was configured as "x86_64-unknown-linux-gnu"...
            
            
            WARNING: kernel relocated [222MB]: patching 97897 gdb minimal_symbol values
            
            
            please wait... (gathering task table data)                             
            crash: radix trees do not exist or have changed their format
            
             
            bobijam Zhenyu Xu added a comment - The reported version is 2.12.6-ddn19 (both client and server). And unfortunately I failed to load teh vmcore with crqsh # crash ./120035_YITP_client_ioerror/usr/lib/debug/lib/modules/4.18.0-240.15.1.el8_3.x86_64/vmlinux 120035_YITP_client_ioerror/vmcore  crash 7.2.3-11.el7_9.1 Copyright (C) 2002-2017  Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation Copyright (C) 1999-2006  Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited Copyright (C) 2006, 2007  VA Linux Systems Japan K.K. Copyright (C) 2005, 2011  NEC Corporation Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.  Enter "help copying" to see the conditions. This program has absolutely no warranty.  Enter "help warranty" for details.   GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.  Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [222MB]: patching 97897 gdb minimal_symbol values please wait... (gathering task table data)                              crash: radix trees do not exist or have changed their format

            People

              bobijam Zhenyu Xu
              bobijam Zhenyu Xu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: