[LU-4216] lockup in mdt_intent_layout -> lu_object_find_at Created: 06/Nov/13  Updated: 27/Nov/13  Resolved: 27/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Attachments: Text File log.21501.gz     Text File log.30740.gz    
Issue Links:
Duplicate
duplicates LU-4106 racer test hang Resolved
Severity: 3
Rank (Obsolete): 11468

 Description   

I've been hittign this lately in racer.
unmount is not able to finish and things hang, but I suspect the lockup is not necessary related to shutdown.
I also have a crashdump dumped about an 30 minutes after the condition was detected, if desired.
This is a fairly recent master too.

[ 9489.412732] LNet: Service thread pid 21501 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 9489.414125] Pid: 21501, comm: mdt00_006
[ 9489.414387] 
[ 9489.414387] Call Trace:
[ 9489.414848]  [<ffffffffa056d834>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
[ 9489.415165]  [<ffffffffa056de4b>] lu_object_find_at+0xab/0x360 [obdclass]
[ 9489.415522]  [<ffffffffa0695976>] ? lustre_msg_string+0x96/0x290 [ptlrpc]
[ 9489.415870]  [<ffffffff8105ad30>] ? default_wake_function+0x0/0x20
[ 9489.416214]  [<ffffffffa06958d5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
[ 9489.416615]  [<ffffffffa056e116>] lu_object_find+0x16/0x20 [obdclass]
[ 9489.416928]  [<ffffffffa0b1bad6>] mdt_object_find+0x56/0x170 [mdt]
[ 9489.417224]  [<ffffffffa0b2bac4>] mdt_getattr_name_lock+0x804/0x19a0 [mdt]
[ 9489.417564]  [<ffffffffa06958d5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
[ 9489.418207]  [<ffffffffa06bc336>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
[ 9489.418555]  [<ffffffffa0697b84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
[ 9489.418870]  [<ffffffffa0b2cef9>] mdt_intent_getattr+0x299/0x480 [mdt]
[ 9489.419172]  [<ffffffffa0b1d5b9>] mdt_intent_policy+0x499/0xca0 [mdt]
[ 9489.419492]  [<ffffffffa064e32a>] ldlm_lock_enqueue+0x2ea/0x860 [ptlrpc]
[ 9489.419814]  [<ffffffffa0676c4f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
[ 9489.420147]  [<ffffffffa06ea772>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
[ 9489.420462]  [<ffffffffa06e8cbf>] tgt_request_handle+0x5ff/0x1200 [ptlrpc]
[ 9489.420813]  [<ffffffffa06a63d5>] ptlrpc_server_handle_request+0x395/0xc20 [ptlrpc]
[ 9489.421314]  [<ffffffffa0ec540f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[ 9489.422763]  [<ffffffffa069dd41>] ? ptlrpc_wait_event+0xc1/0x2e0 [ptlrpc]
[ 9489.423102]  [<ffffffffa06a76ba>] ptlrpc_main+0xa5a/0x1690 [ptlrpc]
[ 9489.423445]  [<ffffffffa06a6c60>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
[ 9489.423801]  [<ffffffff81094726>] kthread+0x96/0xa0
[ 9489.424090]  [<ffffffff8100c10a>] child_rip+0xa/0x20
[ 9489.424395]  [<ffffffff81094690>] ? kthread+0x0/0xa0
[ 9489.424686]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
[ 9489.424955] 
[ 9489.425179] LustreError: dumping log to /tmp/lustre-log.1383705973.21501
[ 9497.613530] LNet: Service thread pid 30740 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 9497.614414] Pid: 30740, comm: mdt01_003
[ 9497.614680] 
[ 9497.614681] Call Trace:
[ 9497.615135]  [<ffffffffa056d834>] ? htable_lookup+0x1c4/0x1e0 [obdclass]
[ 9497.615456]  [<ffffffffa056de4b>] lu_object_find_at+0xab/0x360 [obdclass]
[ 9497.615759]  [<ffffffff8105ad30>] ? default_wake_function+0x0/0x20
[ 9497.616061]  [<ffffffffa056e116>] lu_object_find+0x16/0x20 [obdclass]
[ 9497.616375]  [<ffffffffa0b1bad6>] mdt_object_find+0x56/0x170 [mdt]
[ 9497.616676]  [<ffffffffa0b2478d>] mdt_intent_layout+0x12d/0x640 [mdt]
[ 9497.616975]  [<ffffffffa0b1d5b9>] mdt_intent_policy+0x499/0xca0 [mdt]
[ 9497.617299]  [<ffffffffa064e32a>] ldlm_lock_enqueue+0x2ea/0x860 [ptlrpc]
[ 9497.617684]  [<ffffffffa0676c4f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
[ 9497.618028]  [<ffffffffa06ea772>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
[ 9497.618340]  [<ffffffffa06e8cbf>] tgt_request_handle+0x5ff/0x1200 [ptlrpc]
[ 9497.618672]  [<ffffffffa06a63d5>] ptlrpc_server_handle_request+0x395/0xc20 [ptlrpc]
[ 9497.619169]  [<ffffffffa0ec540f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[ 9497.619496]  [<ffffffffa069dd41>] ? ptlrpc_wait_event+0xc1/0x2e0 [ptlrpc]
[ 9497.619819]  [<ffffffffa06a76ba>] ptlrpc_main+0xa5a/0x1690 [ptlrpc]
[ 9497.620131]  [<ffffffffa06a6c60>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
[ 9497.620422]  [<ffffffff81094726>] kthread+0x96/0xa0
[ 9497.620692]  [<ffffffff8100c10a>] child_rip+0xa/0x20
[ 9497.620955]  [<ffffffff81094690>] ? kthread+0x0/0xa0
[ 9497.621219]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
[ 9497.621489] 
[ 9497.622494] LustreError: dumping log to /tmp/lustre-log.1383705981.30740


 Comments   
Comment by Oleg Drokin [ 06/Nov/13 ]

I guess this might be similar to LU-2492 too.

Comment by Oleg Drokin [ 06/Nov/13 ]

Dumped logfiles from watchdog triggers.

Comment by John Hammond [ 06/Nov/13 ]

Isn't this just LU-4106?

Comment by Andreas Dilger [ 27/Nov/13 ]

Closing as a duplicate of LU-4106, which has patches that should fix this problem.

Generated at Sat Feb 10 01:40:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.