Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- MB
- sequoia

Severity:
3
Rank (Obsolete):
5846

Description

I think this is a duplicate of ~~LU-1640~~, but just in case it's not I'm opening this as a separate ticket.

I've found the following message on the console of our Grove MDS today after upgrading to 2.3.57-1chaos-1surya1 (2.3.57-1chaos + a couple changes for ~~LU-2109~~) and rebooting a few times.

                                                                         
2012-12-13 13:25:25 LNet: Service thread pid 33078 was inactive for 450.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2012-12-13 13:25:25 Pid: 33078, comm: mdt01_020                                    
2012-12-13 13:25:25                                                                
2012-12-13 13:25:25 Call Trace:                                                    
2012-12-13 13:25:25  [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs]         
2012-12-13 13:25:25  [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass]
2012-12-13 13:25:25  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20         
2012-12-13 13:25:25  [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80         
2012-12-13 13:25:25  [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass]
2012-12-13 13:25:25  [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd]          
2012-12-13 13:25:25  [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd]              
2012-12-13 13:25:25  [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]
2012-12-13 13:25:25  [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt]           
2012-12-13 13:25:25  [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt]
2012-12-13 13:25:25  [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt]  
2012-12-13 13:25:25  [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt]       
2012-12-13 13:25:25  [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-12-13 13:25:25  [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs]        
2012-12-13 13:25:25  [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-12-13 13:25:25  [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2012-12-13 13:25:25  [<ffffffff81051ba3>] ? __wake_up+0x53/0x70                    
2012-12-13 13:25:25  [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc]        
2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 13:25:25  [<ffffffff8100c14a>] child_rip+0xa/0x20                       
2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 13:25:25  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 13:25:25  [<ffffffff8100c140>] ? child_rip+0x0/0x20                     
2012-12-13 13:25:25                                                                
2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078

Even nearly an hour later, it looks like the thread is still stuck. I continually see the following message every few minutes:

                                                                         
2012-12-13 14:22:12 INFO: task mdt01_020:33078 blocked for more than 120 seconds.
2012-12-13 14:22:12 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2012-12-13 14:22:12 mdt01_020     D 0000000000000004     0 33078      2 0x00000000
2012-12-13 14:22:12  ffff880f444d7ae0 0000000000000046 0000000000000000 ffff880f444d7b60
2012-12-13 14:22:12  ffff880f444d7a90 ffffc9007e22e02c 0000000000000246 0000000000000246
2012-12-13 14:22:12  ffff880f444d5ab8 ffff880f444d7fd8 000000000000f4e8 ffff880f444d5ab8
2012-12-13 14:22:12 Call Trace:                                                    
2012-12-13 14:22:12  [<ffffffffa05b07de>] cfs_waitq_wait+0xe/0x10 [libcfs]         
2012-12-13 14:22:12  [<ffffffffa0764fe3>] lu_object_find_at+0xb3/0x460 [obdclass]
2012-12-13 14:22:12  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20         
2012-12-13 14:22:12  [<ffffffff8126c735>] ? _atomic_dec_and_lock+0x55/0x80         
2012-12-13 14:22:12  [<ffffffffa07653cf>] lu_object_find_slice+0x1f/0x80 [obdclass]
2012-12-13 14:22:12  [<ffffffffa0c27310>] mdd_object_find+0x10/0x70 [mdd]          
2012-12-13 14:22:12  [<ffffffffa0c2b89f>] mdd_path+0x35f/0x1060 [mdd]              
2012-12-13 14:22:12  [<ffffffffa08f30f1>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]
2012-12-13 14:22:12  [<ffffffffa0c8dfd7>] mdt_get_info+0x567/0xbb0 [mdt]           
2012-12-13 14:22:12  [<ffffffffa0c8941d>] ? mdt_unpack_req_pack_rep+0x4d/0x4c0 [mdt]
2012-12-13 14:22:12  [<ffffffffa0c91782>] mdt_handle_common+0x932/0x1760 [mdt]  
2012-12-13 14:22:12  [<ffffffffa0c92685>] mdt_regular_handle+0x15/0x20 [mdt]       
2012-12-13 14:22:12  [<ffffffffa09036ac>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
2012-12-13 14:22:12  [<ffffffffa05b06be>] ? cfs_timer_arm+0xe/0x10 [libcfs]        
2012-12-13 14:22:12  [<ffffffffa05c214f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
2012-12-13 14:22:12  [<ffffffffa08faa79>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2012-12-13 14:22:12  [<ffffffff81051ba3>] ? __wake_up+0x53/0x70                    
2012-12-13 14:22:12  [<ffffffffa0904c45>] ptlrpc_main+0xbb5/0x1970 [ptlrpc]        
2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 14:22:12  [<ffffffff8100c14a>] child_rip+0xa/0x20                       
2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 14:22:12  [<ffffffffa0904090>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]        
2012-12-13 14:22:12  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I've attached the lustre log file dumped by the LNet message above:

                                                                         
2012-12-13 13:25:25 LustreError: dumping log to /tmp/lustre-log.1355433925.33078

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1355433925.33078
802 kB
13/Dec/12 5:52 PM

Issue Links

duplicates

LU-3870 deadlock with journal

Closed

is duplicated by

LU-1640 Test failure on test suite lustre-rsync-test, subtest test_2c

Resolved

MDT thread stuck: mdd_object_find -> lu_object_find_at

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates