[LU-5432] bogus FIDs cause endless loops in fld_client_rpc() Created: 30/Jul/14  Updated: 06/Oct/14  Resolved: 06/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: fault, fld

Severity: 3
Rank (Obsolete): 15132

 Description   

If a secondary MDT tries to use a FID with a bogus sequence then the handler will loop forever in fld_client_rpc():

[ 9667.196083] LustreError: 28865:0:(fld_handler.c:261:fld_server_lookup()) srv-lustre-MDT0000: Cannot find sequence 0x4000002c0000401: rc = -2
[ 9667.198478] LustreError: 28865:0:(fld_handler.c:261:fld_server_lookup()) Skipped 167399 previous similar messages
[ 9699.057201] LNet: 4534:0:(watchdog.c:200:lcw_dump_stack()) Service thread pid 17054 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume  later. Dumping the stack trace for debugging purposes:
[ 9699.061160] Pid: 17054, comm: mdt01_011

17054 mdt01_011
[<ffffffffa06821fa>] ptlrpc_set_wait+0x2ea/0x830 [ptlrpc]
[<ffffffffa06827c7>] ptlrpc_queue_wait+0x87/0x220 [ptlrpc]
[<ffffffffa087455b>] fld_client_rpc+0x15b/0x4b0 [fld]
[<ffffffffa0879c81>] fld_server_lookup+0x151/0x340 [fld]
[<ffffffffa0d6f567>] lod_fld_lookup+0x1e7/0x350 [lod]
[<ffffffffa0d81b63>] lod_object_init+0x103/0x3c0 [lod]
[<ffffffffa0455b98>] lu_object_alloc+0xd8/0x320 [obdclass]
[<ffffffffa045718f>] lu_object_find_at+0x2bf/0x410 [obdclass]
[<ffffffffa04572f6>] lu_object_find+0x16/0x20 [obdclass]
[<ffffffffa0c95f56>] mdt_object_find+0x56/0x170 [mdt]
[<ffffffffa0ccbe71>] mdt_reint_open+0x2e1/0x2180 [mdt]
[<ffffffffa0cb2811>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0c9cdb3>] mdt_reint_internal+0x4d3/0x7b0 [mdt]
[<ffffffffa0c9d286>] mdt_intent_reint+0x1f6/0x520 [mdt]
[<ffffffffa0c9b929>] mdt_intent_policy+0x499/0xcf0 [mdt]
[<ffffffffa0644342>] ldlm_lock_enqueue+0x302/0x880 [ptlrpc]
[<ffffffffa066c343>] ldlm_handle_enqueue0+0x373/0x1130 [ptlrpc]
[<ffffffffa06eb592>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
[<ffffffffa06eacbe>] tgt_request_handle+0x71e/0xb10 [ptlrpc]
[<ffffffffa069d847>] ptlrpc_main+0xd47/0x1860 [ptlrpc]
[<ffffffff8109eab6>] kthread+0x96/0xa0
[<ffffffff8100c30a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

This was found through RPC corruption.



 Comments   
Comment by John Hammond [ 26/Aug/14 ]

Please see http://review.whamcloud.com/11605. Unless someone can suggest a cleaner way to return a non-retryable failure.

Comment by Peter Jones [ 06/Oct/14 ]

Landed for 2.7

Generated at Sat Feb 10 01:51:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.