Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.6.0, Lustre 2.5.2, Lustre 2.15.0
-
None
-
3
-
14680
Description
The user processes waiting in wait_for_completion() (osc_io_setattr_end() and osc_io_fsync_end()) are unkillable and require the node to be rebooted if the server is unavailable:
LustreError: 13775:0:(ofd_obd.c:873:ofd_setattr()) testfs-OST0001: can't find object [0x100000000:0x5:0x0] Lustre: testfs-OST0001-o: trigger OI scrub by RPC for [0x100000000:0x5:0x0], rc = 0 [1] INFO: task touch:15134 blocked for more than 120 seconds. touch D 0000000000000001 0 15134 15113 0x00000000 Call Trace: [<ffffffff8150f475>] schedule_timeout+0x215/0x2e0 [<ffffffff8150f0f3>] wait_for_common+0x123/0x180 [<ffffffff8150f20d>] wait_for_completion+0x1d/0x20 [<ffffffffa0cdba7c>] osc_io_setattr_end+0xbc/0x190 [osc] [<ffffffffa08bd100>] cl_io_end+0x60/0x150 [obdclass] [<ffffffffa0d554b1>] lov_io_end_wrapper+0xf1/0x100 [lov] [<ffffffffa0d551fe>] lov_io_call+0x8e/0x130 [lov] [<ffffffffa0d56f8c>] lov_io_end+0x4c/0xf0 [lov] [<ffffffffa08bd100>] cl_io_end+0x60/0x150 [obdclass] [<ffffffffa08c1e82>] cl_io_loop+0xc2/0x1b0 [obdclass] [<ffffffffa11838d8>] cl_setattr_ost+0x218/0x2f0 [lustre] [<ffffffffa11501cc>] ll_setattr_raw+0xa2c/0x1080 [lustre] [<ffffffffa115087d>] ll_setattr+0x5d/0xf0 [lustre] [<ffffffff8119ead8>] notify_change+0x168/0x340 [<ffffffff811b2b7c>] utimes_common+0xdc/0x1b0 [<ffffffff811b2ce9>] do_utimes+0x99/0xf0 [<ffffffff811b2e42>] sys_utimensat+0x32/0x90
The problem being hit on the OST is somewhat irrelevant for the purposes of this bug. It would be ideal if the client actually handled this error properly and didn't hang at all, but there will always be some other case where the OST is inactive and the client doesn't get any reply at all.
Instead of using wait_for_completion() this could use l_wait_event() or wait_for_completion_killable() so that the user process can be killed if there is a problem on the OST.
Attachments
Issue Links
- is related to
-
LU-5446 Test timeout lustre-rsync-test test_4: NULL deref osc_sync_interpret+0x147
-
- Resolved
-