[LU-5261] user process is unkillable in wait_for_completion() Created: 26/Jun/14  Updated: 30/Jan/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.2, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Emoly Liu
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5446 Test timeout lustre-rsync-test test_4... Resolved
Severity: 3
Rank (Obsolete): 14680

 Description   

The user processes waiting in wait_for_completion() (osc_io_setattr_end() and osc_io_fsync_end()) are unkillable and require the node to be rebooted if the server is unavailable:

LustreError: 13775:0:(ofd_obd.c:873:ofd_setattr()) testfs-OST0001: can't find object [0x100000000:0x5:0x0]
Lustre: testfs-OST0001-o: trigger OI scrub by RPC for [0x100000000:0x5:0x0], rc = 0 [1]
INFO: task touch:15134 blocked for more than 120 seconds.
touch         D 0000000000000001     0 15134  15113 0x00000000
Call Trace:
 [<ffffffff8150f475>] schedule_timeout+0x215/0x2e0
 [<ffffffff8150f0f3>] wait_for_common+0x123/0x180
 [<ffffffff8150f20d>] wait_for_completion+0x1d/0x20
 [<ffffffffa0cdba7c>] osc_io_setattr_end+0xbc/0x190 [osc]
 [<ffffffffa08bd100>] cl_io_end+0x60/0x150 [obdclass]
 [<ffffffffa0d554b1>] lov_io_end_wrapper+0xf1/0x100 [lov]
 [<ffffffffa0d551fe>] lov_io_call+0x8e/0x130 [lov]
 [<ffffffffa0d56f8c>] lov_io_end+0x4c/0xf0 [lov]
 [<ffffffffa08bd100>] cl_io_end+0x60/0x150 [obdclass]
 [<ffffffffa08c1e82>] cl_io_loop+0xc2/0x1b0 [obdclass]
 [<ffffffffa11838d8>] cl_setattr_ost+0x218/0x2f0 [lustre]
 [<ffffffffa11501cc>] ll_setattr_raw+0xa2c/0x1080 [lustre]
 [<ffffffffa115087d>] ll_setattr+0x5d/0xf0 [lustre]
 [<ffffffff8119ead8>] notify_change+0x168/0x340
 [<ffffffff811b2b7c>] utimes_common+0xdc/0x1b0
 [<ffffffff811b2ce9>] do_utimes+0x99/0xf0
 [<ffffffff811b2e42>] sys_utimensat+0x32/0x90

The problem being hit on the OST is somewhat irrelevant for the purposes of this bug. It would be ideal if the client actually handled this error properly and didn't hang at all, but there will always be some other case where the OST is inactive and the client doesn't get any reply at all.

Instead of using wait_for_completion() this could use l_wait_event() or wait_for_completion_killable() so that the user process can be killed if there is a problem on the OST.



 Comments   
Comment by Emoly Liu [ 09/Jul/14 ]

Here is the patch: http://review.whamcloud.com/11021

Comment by Emoly Liu [ 31/Jul/14 ]

Patch landed to 2.6

Comment by Jodi Levi (Inactive) [ 02/Oct/14 ]

Reopening as the patch has been reverted and needs to be fixed and landed to Master.

Comment by nasf (Inactive) [ 07/Mar/15 ]

I hit similar trouble in ost-pools test_20 on master:
https://testing.hpdd.intel.com/test_sets/122e8ea2-c3f8-11e4-94d2-5254006e85c2

Comment by Li Xi (Inactive) [ 21/May/15 ]

Is there any chance that the patch can be revised to a better version? We are seeing this problem frequently, especially when running rsync.

Generated at Sat Feb 10 01:49:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.