Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2492

MDT thread stuck: mdd_object_find -> lu_object_find_at

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 5846

    Description

      I think this is a duplicate of LU-1640, but just in case it's not I'm opening this as a separate ticket.

      I've found the following message on the console of our Grove MDS today after upgrading to 2.3.57-1chaos-1surya1 (2.3.57-1chaos + a couple changes for LU-2109) and rebooting a few times.

                                                                               
      2012-12-13 13:25:25 LNet: Service thread pid 33078 was inactive for 450.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      2012-12-13 13:25:25 Pid: 33078, comm: mdt01_020                                    
      2012-12-13 13:25:25                                                                
      2012-12-13 13:25:25 Call Trace:                                                    
      2012-12-13 13:25:25  [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs]         
      2012-12-13 13:25:25  [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass]
      2012-12-13 13:25:25  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20         
      2012-12-13 13:25:25  [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80         
      2012-12-13 13:25:25  [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass]
      2012-12-13 13:25:25  [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd]          
      2012-12-13 13:25:25  [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd]              
      2012-12-13 13:25:25  [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]
      2012-12-13 13:25:25  [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt]           
      2012-12-13 13:25:25  [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt]
      2012-12-13 13:25:25  [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt]  
      2012-12-13 13:25:25  [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt]       
      2012-12-13 13:25:25  [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
      2012-12-13 13:25:25  [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs]        
      2012-12-13 13:25:25  [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
      2012-12-13 13:25:25  [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2012-12-13 13:25:25  [<ffffffff81051ba3>] ? __wake_up+0x53/0x70                    
      2012-12-13 13:25:25  [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc]        
      2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 13:25:25  [<ffffffff8100c14a>] child_rip+0xa/0x20                       
      2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 13:25:25  [<ffffffff8100c140>] ? child_rip+0x0/0x20                     
      2012-12-13 13:25:25                                                                
      2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078
      

      Even nearly an hour later, it looks like the thread is still stuck. I continually see the following message every few minutes:

                                                                               
      2012-12-13 14:22:12 INFO: task mdt01_020:33078 blocked for more than 120 seconds.
      2012-12-13 14:22:12 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      2012-12-13 14:22:12 mdt01_020     D 0000000000000004     0 33078      2 0x00000000
      2012-12-13 14:22:12  ffff880f444d7ae0 0000000000000046 0000000000000000 ffff880f444d7b60
      2012-12-13 14:22:12  ffff880f444d7a90 ffffc9007e22e02c 0000000000000246 0000000000000246
      2012-12-13 14:22:12  ffff880f444d5ab8 ffff880f444d7fd8 000000000000f4e8 ffff880f444d5ab8
      2012-12-13 14:22:12 Call Trace:                                                    
      2012-12-13 14:22:12  [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs]         
      2012-12-13 14:22:12  [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass]
      2012-12-13 14:22:12  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20         
      2012-12-13 14:22:12  [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80         
      2012-12-13 14:22:12  [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass]
      2012-12-13 14:22:12  [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd]          
      2012-12-13 14:22:12  [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd]              
      2012-12-13 14:22:12  [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]
      2012-12-13 14:22:12  [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt]           
      2012-12-13 14:22:12  [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt]
      2012-12-13 14:22:12  [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt]  
      2012-12-13 14:22:12  [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt]       
      2012-12-13 14:22:12  [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
      2012-12-13 14:22:12  [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs]        
      2012-12-13 14:22:12  [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
      2012-12-13 14:22:12  [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      2012-12-13 14:22:12  [<ffffffff81051ba3>] ? __wake_up+0x53/0x70                    
      2012-12-13 14:22:12  [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc]        
      2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 14:22:12  [<ffffffff8100c14a>] child_rip+0xa/0x20                       
      2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
      2012-12-13 14:22:12  [<ffffffff8100c140>] ? child_rip+0x0/0x20                     
      

      I've attached the lustre log file dumped by the LNet message above:

                                                                               
      2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078
      

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: