[LU-6039] sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds. Created: 17/Dec/14 Updated: 18/Dec/14 Resolved: 18/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Vinayak Hariharmath (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
single node setup on local vm |
||
| Attachments: |
|
| Epic/Theme: | test |
| Severity: | 3 |
| Epic: | client, test |
| Project: | HSM |
| Rank (Obsolete): | 16831 |
| Description |
|
cmd : # PDSH="pdsh -R ssh -S -w" ONLY=228 MDS_MOUNT_OPTS="-o rw,user_xattr" OST_MOUNT_OPTS="-o user_xattr" AGTCOUNT=1 agt1_HOST=vinayak lustre/tests/sanity-hsm.sh Dec 17 11:37:58 vinayak_centos kernel: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds. Dec 17 11:37:58 vinayak_centos kernel: Not tainted 2.6.32.431.20.3.el6_lustre #1 Dec 17 11:37:58 vinayak_centos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 17 11:37:58 vinayak_centos kernel: lhsmtool_posi D 0000000000000001 0 26565 1 0x00000080 Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 0000000000000082 0000000000000000 ffff88000df84000 Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 ffffffff8128d1f6 0000000200000010 0000000affffffff Dec 17 11:37:58 vinayak_centos kernel: ffff88002d703098 ffff880000f4bfd8 000000000000fbc8 ffff88002d703098 Dec 17 11:37:58 vinayak_centos kernel: Call Trace: Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8128d1f6>] ? vsnprintf+0x336/0x5e0 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a40e>] __mutex_lock_slowpath+0x13e/0x180 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a2ab>] mutex_lock+0x2b/0x50 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f3219a>] ll_layout_refresh+0x1da/0xcd0 [lustre] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07c4ab3>] ? ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f5cd80>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07e87a0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0606011>] ? cl_io_slice_add+0xc1/0x190 [obdclass] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f82d70>] vvp_io_init+0x340/0x490 [lustre] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0605088>] cl_io_init0+0x88/0x150 [obdclass] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa05fe505>] ? cl_env_get+0x195/0x350 [obdclass] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0608004>] cl_io_init+0x64/0xe0 [obdclass] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f78ab1>] cl_glimpse_size0+0x91/0x1d0 [lustre] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f27966>] ll_getattr+0x2e6/0x800 [lustre] Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8119b442>] ? user_path_at+0x62/0xa0 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e851>] vfs_getattr+0x51/0x80 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e8e4>] vfs_fstatat+0x64/0xa0 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e944>] sys_newfstatat+0x24/0x50 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff810e1e87>] ? audit_syscall_entry+0x1d7/0x200 Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 17/Dec/14 ] |
|
If you can reproduce this problem, please use the following command: echo t > /proc/sysrq-trigger dmesg > log.txt and then attach the log.txt file here. I believe there is another process holding the mutex and waiting for response from the MDT. |
| Comment by Andreas Dilger [ 17/Dec/14 ] |
|
This bug is filed against 2.5.0. Are you really running that version, or is it 2.5.3 (which you probably should use, or even the tip of b2_5)? |
| Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ] |
|
I ran it on master. Wrongly updated affect/versions. The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed. I mimicked the steps manually if we skip step 6, then everything goes fine. Even I checked with other operation (copy) instead of step 8 After this I got few of the questions in my mind Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs. |
| Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ] |
|
logs file attached |
| Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ] |
|
Got answer for Below record proves that # lfs hsm_state test/tfile
test/tfile: (0x00000009) exists archived, archive_id:1
# lfs hsm_release test/tfile
# lfs hsm_state test/tfile
test/tfile: (0x0000000d) released exists archived, archive_id:1
# cp test/tfile test/tfile.2
lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1
lhsmtool_posix[13264]: waiting for message from kernel
lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9
lhsmtool_posix[13296]: processing file 'test/tfile'
lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0'
lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done
lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0
lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0)
# lfs hsm_state test/tfile
test/tfile: (0x00000009) exists archived, archive_id:1
# lfs hsm_release test/tfile
# lfs hsm_state test/tfile
test/tfile: (0x0000000d) released exists archived, archive_id:1
|
| Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ] |
|
is it related to |
| Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ] |
|
It looks like the same issue of |
| Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ] |
|
duplication of |