[LU-10423] sanity-hsm test_223b: lctl hung on cdt_llog_lock Created: 21/Dec/17 Updated: 22/Dec/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_223b failed with the following error: Timeout occurred after 182 mins, last suite running was sanity-hsm, restarting cluster to continue tests Please provide additional information about the failure here. Info required for matching: sanity-hsm 223b lctl hangs up on cdt_llog_lock in mdt_hsm_actions_proc_show() Dec 16 04:53:58 trevis-3vm8 mrshd[21754]: root@trevis-3vm5 as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000402:0x25a:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=");echo XXRETCODE:$?'
Dec 16 04:56:57 trevis-3vm8 kernel: INFO: task lctl:21787 blocked for more than 120 seconds.
Dec 16 04:56:57 trevis-3vm8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 16 04:56:57 trevis-3vm8 kernel: lctl D ffff88007b1dbd70 0 21787 21786 0x00000080
Dec 16 04:56:57 trevis-3vm8 kernel: ffff88007b1dbd48 0000000000000086 ffff88005de73f40 ffff88007b1dbfd8
Dec 16 04:56:57 trevis-3vm8 kernel: ffff88007b1dbfd8 ffff88007b1dbfd8 ffff88005de73f40 ffff88005de73f40
Dec 16 04:56:57 trevis-3vm8 kernel: ffff8800566ef9a8 fffffffeffffffff ffff8800566ef9b0 ffff88007b1dbd70
Dec 16 04:56:57 trevis-3vm8 kernel: Call Trace:
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff816a9589>] schedule+0x29/0x70
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff816aabbd>] rwsem_down_read_failed+0x10d/0x1a0
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff81331ed8>] call_rwsem_down_read_failed+0x18/0x30
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff81225f17>] ? seq_buf_alloc+0x17/0x40
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff816a8820>] down_read+0x20/0x40
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffffc12656e7>] mdt_hsm_actions_proc_show+0xf7/0x250 [mdt]
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff8122641a>] seq_read+0x10a/0x3b0
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff812703cd>] proc_reg_read+0x3d/0x80
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff81200b1c>] vfs_read+0x9c/0x170
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff812019df>] SyS_read+0x7f/0xe0
Dec 16 04:56:57 trevis-3vm8 kernel: [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
|
| Comments |
| Comment by Nathaniel Clark [ 21/Dec/17 ] |
|
Base failure is probably |
| Comment by John Hammond [ 21/Dec/17 ] |
|
As I understand it, when this is encountered all IO against the pool just hnags. Is that right? |
| Comment by Nathaniel Clark [ 22/Dec/17 ] |
|
Yes. This happens after |