[LU-5489] ll_ost thread stuck at lu_object_find_at Created: 14/Aug/14 Updated: 02/Oct/14 Resolved: 03/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
source at https://github.com/jlan/lustre-nas |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 15312 |
| Description |
|
Attached file (service164.gz) has complete trace of all threads. LNet: Service thread pid 8805 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:^M Pid: 8805, comm: ll_ost03_040^M ^M Call Trace:^M [<ffffffffa04cf6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]^M [<ffffffffa062c6b3>] lu_object_find_at+0xb3/0x360 [obdclass]^M [<ffffffff81063be0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0e74cb9>] ? ofd_key_init+0x59/0x1a0 [ofd]^M [<ffffffffa062c976>] lu_object_find+0x16/0x20 [obdclass]^M [<ffffffffa0e886c5>] ofd_object_find+0x35/0xf0 [ofd]^M [<ffffffffa062d57e>] ? lu_env_init+0x1e/0x30 [obdclass]^M [<ffffffffa0e98649>] ofd_lvbo_update+0x6d9/0xea8 [ofd]^M [<ffffffffa0e7df77>] ofd_setattr+0x7e7/0xb80 [ofd]^M [<ffffffffa0e4ec1c>] ost_setattr+0x31c/0x990 [ost]^M [<ffffffffa0e52746>] ost_handle+0x21e6/0x48e0 [ost]^M [<ffffffffa04db124>] ? libcfs_id2str+0x74/0xb0 [libcfs]^M [<ffffffffa07c53b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M [<ffffffffa04cf5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M [<ffffffffa04e0d6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M [<ffffffffa07bc719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M [<ffffffff81063be0>] ? default_wake_function+0x0/0x20^M [<ffffffffa07c674e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M [<ffffffffa07c5c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M [<ffffffff8100c0ca>] child_rip+0xa/0x20^M [<ffffffffa07c5c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M [<ffffffffa07c5c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M |
| Comments |
| Comment by Oleg Drokin [ 14/Aug/14 ] |
|
I guess this is similar in nature to lu4725 only this time in ofd code. This should be a somewhat rare race. ofd_setattr does ofd_object_find and pins and object. Then some other thread destroys and object and then ldlm_res_lvbo_update->ofd_lvbo_update() does ofd_object_find, finds the now destroyed object and starts to wait till the referenes go away, but they cannot because it's this same thread that's holding the reference. Technically the object should not ever be deleted because we are supposed to hold an ldlm lock on it, but I imagine if the lock was somehow lost (held by a client, so if a client was evicted for example) - this might happen. We need to add some sort of a non-racy check to make sure the object is till alive before going into lvbo update for a fix. Looking in the logs we can see there was a bunch of evictions indeed, sso this indeed is a plausible scenario. |
| Comment by Peter Jones [ 15/Aug/14 ] |
|
Hongchao Could you please look into the feasibility of reworking this code in the manner Oleg suggests? Thanks Peter |
| Comment by Oleg Drokin [ 20/Aug/14 ] |
|
After some additional digging I found that this is actually a dup of |