[LU-8010] lfs hsm command hangs up after lfs hsm_cancel Created: 12/Apr/16  Updated: 27/Sep/16  Resolved: 16/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Tatsushi Takamura Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Issue Links:
Related
Severity: 3
Project: HSM
Rank (Obsolete): 9223372036854775807

 Description   

When hsm_restore are canceled, the following lfs hsm command hangs up.
To reproduce, setup HSM and do the following
(We found 2 patterns. They seem to be hanged at different points.):

(1)

# lfs hsm_state /lustre/file
/lustre/file: (0x0000000d) released exists archived, archive_id:1
# lfs hsm_restore /lustre/file
# lfs hsm_cancel /lustre/file
# lfs hsm_restore /lustre/file (quickly after lfs hsm_cancel)
# lfs hsm_restore /lustre/file  *hang up

The call trace is as following:

PID: 9550   TASK: ffff880c33f86040  CPU: 13  COMMAND: "lfs"
 #0 [ffff880c1ca8f568] schedule at ffffffff81539170
 #1 [ffff880c1ca8f640] schedule_timeout at ffffffff8153a042
 #2 [ffff880c1ca8f6f0] ldlm_completion_ast at ffffffffa0847fc9 [ptlrpc]
 #3 [ffff880c1ca8f7a0] ldlm_cli_enqueue_fini at ffffffffa08420e6 [ptlrpc]
 #4 [ffff880c1ca8f840] ldlm_cli_enqueue at ffffffffa08429a1 [ptlrpc]
 #5 [ffff880c1ca8f8f0] mdc_enqueue at ffffffffa0a6e8aa [mdc]
 #6 [ffff880c1ca8fa40] lmv_enqueue at ffffffffa0a25bfb [lmv]
 #7 [ffff880c1ca8fac0] ll_layout_refresh_locked at ffffffffa0b084f6 [lustre]
 #8 [ffff880c1ca8fc00] ll_layout_refresh at ffffffffa0b09159 [lustre]
 #9 [ffff880c1ca8fc50] vvp_io_init at ffffffffa0b5227f [lustre]
#10 [ffff880c1ca8fcc0] cl_io_init0 at ffffffffa06a8e78 [obdclass]
#11 [ffff880c1ca8fd00] cl_io_init at ffffffffa06abdf4 [obdclass]
#12 [ffff880c1ca8fd40] cl_glimpse_size0 at ffffffffa0b4bc05 [lustre]
#13 [ffff880c1ca8fda0] ll_getattr at ffffffffa0b07cb8 [lustre]
#14 [ffff880c1ca8fe40] vfs_getattr at ffffffff81197c61
#15 [ffff880c1ca8fe80] vfs_fstatat at ffffffff81197cf4
#16 [ffff880c1ca8fee0] vfs_lstat at ffffffff81197d9e
#17 [ffff880c1ca8fef0] sys_newlstat at ffffffff81197dc4
#18 [ffff880c1ca8ff80] tracesys at ffffffff8100b2e8 (via system_call)

(2)

# lfs hsm_state /lustre/file
/lustre/file: (0x0000000d) released exists archived, archive_id:1
# lfs hsm_restore /lustre/file
<MDS Immediate reset>
<MDS recover and HSM setup>
# lfs hsm_action /lustre/file
/lustre/file: RESTORE waiting (from 0 to EOF)
# lfs hsm_state /lustre/file
/lustre/file: (0x0000000d) released exists archived, archive_id:1
# lfs hsm_cancel /lustre/file
# lfs hsm_restore /lustre/file   *hang up

The call trace is as following:

PID: 3731   TASK: ffff880c35087520  CPU: 13  COMMAND: "lfs"
 #0 [ffff880c327df598] schedule at ffffffff81539170
 #1 [ffff880c327df670] schedule_timeout at ffffffff8153a042
 #2 [ffff880c327df720] ptlrpc_set_wait at ffffffffa0860c41 [ptlrpc]
 #3 [ffff880c327df7e0] ptlrpc_queue_wait at ffffffffa0861301 [ptlrpc]
 #4 [ffff880c327df800] mdc_ioc_hsm_request at ffffffffa0a65138 [mdc]
 #5 [ffff880c327df830] mdc_iocontrol at ffffffffa0a66df9 [mdc]
 #6 [ffff880c327df950] obd_iocontrol at ffffffffa0a16aa5 [lmv]
 #7 [ffff880c327df9a0] lmv_iocontrol at ffffffffa0a2d9dc [lmv]
 #8 [ffff880c327dfb90] obd_iocontrol at ffffffffa0af2d15 [lustre]
 #9 [ffff880c327dfbe0] ll_dir_ioctl at ffffffffa0af9ae8 [lustre]
#10 [ffff880c327dfe60] vfs_ioctl at ffffffff811a7972
#11 [ffff880c327dfea0] do_vfs_ioctl at ffffffff811a7b14
#12 [ffff880c327dff30] sys_ioctl at ffffffff811a8091
#13 [ffff880c327dff80] tracesys at ffffffff8100b2e8 (via system_call)


 Comments   
Comment by Bruno Faccini (Inactive) [ 13/Apr/16 ]

Thanks for this report and the reproducer. I am presently working on a similar platform and to reproduce.
Did you notice some possibly related messages in Server/MDS, Client, Agent syslogs?

Comment by Tatsushi Takamura [ 14/Apr/16 ]

The syslog messages is diffrecnt in the 2 patterns.

(1)
The following is possibly related messages in the MDS syslog.

Apr 14 13:31:04 mds kernel: Lustre: lustre-MDT0000: Client 4e2ff398-da98-562d-99e7-cd38a8709ba2 (at 192.168.128.85@o2ib) reconnecting
Apr 14 13:31:04 mds kernel: Lustre: lustre-MDT0000: Connection restored to ce03607c-8712-5fed-7d04-6d296439e868 (at 192.168.128.85@o2ib)

The following is possibly related messages in the Client(which I run lfs command at) syslog.

Apr 14 13:31:04 clinet kernel: Lustre: lustre-MDT0000-mdc-ffff880c35677c00: Connection to lustre-MDT0000 (at 192.168.128.81@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Apr 14 13:31:04 clinet kernel: Lustre: Skipped 1 previous similar message
Apr 14 13:31:04 clinet kernel: LustreError: 6021:0:(ldlm_request.c:125:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1460607964, 300s ago), entering recovery for lustre-MDT0000_UUID@192.168.128.81@o2ib ns: lustre-MDT0000-mdc-ffff880c35677c00 lock: ffff880c3471dac0/0x241d687a0e4c12e9 lrc: 4/1,0 mode: --/CR res: [0x200000bd1:0x1:0x0].0x0 bits 0x8 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x936186c7472a77bf expref: -99 pid: 6021 timeout: 0 lvb_type: 3
Apr 14 13:31:04 clinet kernel: Lustre: lustre-MDT0000-mdc-ffff880c35677c00: Connection restored to 192.168.128.81@o2ib (at 192.168.128.81@o2ib)

(2)
The following is possibly related messages in the MDS syslog.

Apr 14 14:16:09 mds kernel: LNet: Service thread pid 3763 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Apr 14 14:16:09 mds kernel: Pid: 3763, comm: mdt02_004
Apr 14 14:16:09 mds kernel:
Apr 14 14:16:09 mds kernel: Call Trace:
Apr 14 14:16:09 mds kernel: [<ffffffff8153a042>] schedule_timeout+0x192/0x2e0
Apr 14 14:16:09 mds kernel: [<ffffffff81089be0>] ? process_timeout+0x0/0x10
Apr 14 14:16:09 mds kernel: [<ffffffffa08b9e30>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffffa08be989>] ldlm_completion_ast+0x609/0x9b0 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
Apr 14 14:16:09 mds kernel: [<ffffffffa08bdd8e>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffffa08be380>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffffa1025130>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa10318e4>] mdt_object_local_lock+0x3a4/0xb00 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa1025130>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa08be380>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffffa06cf791>] ? lu_object_find_at+0xb1/0xe0 [obdclass]
Apr 14 14:16:09 mds kernel: [<ffffffffa10323b3>] mdt_object_lock_internal+0x63/0x310 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa1032874>] mdt_object_lock+0x14/0x20 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa1032a2e>] mdt_object_find_lock+0x5e/0x170 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa106a683>] mdt_hsm_add_actions+0x973/0x12b0 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffff8129957a>] ? strlcpy+0x4a/0x60
Apr 14 14:16:09 mds kernel: [<ffffffffa103d50f>] ? ucred_set_jobid+0x5f/0x70 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa106116c>] mdt_hsm_request+0x60c/0x9b0 [mdt]
Apr 14 14:16:09 mds kernel: [<ffffffffa094ec2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffffa08fbc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffff815391be>] ? thread_return+0x4e/0x7d0
Apr 14 14:16:09 mds kernel: [<ffffffffa08faf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
Apr 14 14:16:09 mds kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
Apr 14 14:16:09 mds kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Apr 14 14:16:09 mds kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
Apr 14 14:16:09 mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Apr 14 14:16:09 mds kernel:
Apr 14 14:16:09 mds kernel: LustreError: dumping log to /tmp/lustre-log.1460610969.3763
Apr 14 14:17:49 mds kernel: LustreError: 3763:0:(ldlm_request.c:106:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1460610768, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: ffff880c18f31980/0x605de39e648b48a7 lrc: 3/0,1 mode: --/EX res: [0x200000402:0x1:0x0].0x0 bits 0x8 rrc: 3 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 3763 timeout: 0 lvb_type: 0
Apr 14 14:17:49 mds kernel: LustreError: dumping log to /tmp/lustre-log.1460611069.3763

There is no related messages in the Client(which I run lfs command at) syslog.

Comment by Bruno Faccini (Inactive) [ 21/Apr/16 ]

In both reproducers cases/scenarios, the root cause is the layout_lock not being released and leading to hung threads on both Client/MDS sides during handling of HSM requests on same file.

I think I have found a hole in HSM code that can lead to a layout_lock taken for a restore not being released. I will push a tentative patch to fix it.

Comment by Gerrit Updater [ 21/Apr/16 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19710
Subject: LU-8010 mdt: unlock layout when failing restore
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 656abdd8a342902e67dc77a86e30fb0e0a30c8da

Comment by Bruno Faccini (Inactive) [ 22/Apr/16 ]

Hello Tatsushi,
My local testing, using the reproducers you had detailed, with my patch did not trigger a new hang situation.
Can you give it a try at your site ?

Comment by Tatsushi Takamura [ 26/Apr/16 ]

Hello Bruno,

I have confirmed that command hang-up from the reproducer(1) is resolved, but the restore command still hangs up using the the reproducer(2).
Could you check again?

Comment by Bruno Faccini (Inactive) [ 29/Apr/16 ]

Right, I double-checked and reproducer #2 is an other condition, leading to the same consequence (orphan layout lock), that must be fixed alternatively. In a few words, my understatnding is that a restore presently started when a MDT failure occurs must be set back in WAITING state to be replayed (if legal) upon CDT restart. This will allow to cancel it without leaving associated layout locked.
I have pushed a new version/patch-set, let see how it works.

Comment by Tatsushi Takamura [ 08/May/16 ]

Sorry for the late reply. I have checked with your new patch and both reproducers did not trigger a hang-up.
Thanks

Comment by Gerrit Updater [ 16/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19710/
Subject: LU-8010 mdt: fix orphan layout_lock cases for restore
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c13ddec8e1cd3a63c16e08f28749771200b92f1b

Comment by Peter Jones [ 16/May/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:13:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.