[LU-6039] sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds. Created: 17/Dec/14  Updated: 18/Dec/14  Resolved: 18/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Vinayak Hariharmath (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: HSM
Environment:

single node setup on local vm


Attachments: HTML File LU-6039_dmesg    
Epic/Theme: test
Severity: 3
Epic: client, test
Project: HSM
Rank (Obsolete): 16831

 Description   

cmd :

 # PDSH="pdsh -R ssh -S -w" ONLY=228 MDS_MOUNT_OPTS="-o rw,user_xattr" OST_MOUNT_OPTS="-o user_xattr" AGTCOUNT=1 agt1_HOST=vinayak lustre/tests/sanity-hsm.sh
Dec 17 11:37:58 vinayak_centos kernel: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.
Dec 17 11:37:58 vinayak_centos kernel:      Not tainted 2.6.32.431.20.3.el6_lustre #1
Dec 17 11:37:58 vinayak_centos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 17 11:37:58 vinayak_centos kernel: lhsmtool_posi D 0000000000000001     0 26565      1 0x00000080
Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 0000000000000082 0000000000000000 ffff88000df84000
Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 ffffffff8128d1f6 0000000200000010 0000000affffffff
Dec 17 11:37:58 vinayak_centos kernel: ffff88002d703098 ffff880000f4bfd8 000000000000fbc8 ffff88002d703098
Dec 17 11:37:58 vinayak_centos kernel: Call Trace:
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8128d1f6>] ? vsnprintf+0x336/0x5e0
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a40e>] __mutex_lock_slowpath+0x13e/0x180
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a2ab>] mutex_lock+0x2b/0x50
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f3219a>] ll_layout_refresh+0x1da/0xcd0 [lustre]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07c4ab3>] ? ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f5cd80>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07e87a0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0606011>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f82d70>] vvp_io_init+0x340/0x490 [lustre]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0605088>] cl_io_init0+0x88/0x150 [obdclass]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa05fe505>] ? cl_env_get+0x195/0x350 [obdclass]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0608004>] cl_io_init+0x64/0xe0 [obdclass]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f78ab1>] cl_glimpse_size0+0x91/0x1d0 [lustre]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f27966>] ll_getattr+0x2e6/0x800 [lustre]
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8119b442>] ? user_path_at+0x62/0xa0
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e851>] vfs_getattr+0x51/0x80
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e8e4>] vfs_fstatat+0x64/0xa0
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e944>] sys_newfstatat+0x24/0x50
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff810e1e87>] ? audit_syscall_entry+0x1d7/0x200
Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Jinshan Xiong (Inactive) [ 17/Dec/14 ]

If you can reproduce this problem, please use the following command:

echo t > /proc/sysrq-trigger
dmesg > log.txt

and then attach the log.txt file here. I believe there is another process holding the mutex and waiting for response from the MDT.

Comment by Andreas Dilger [ 17/Dec/14 ]

This bug is filed against 2.5.0. Are you really running that version, or is it 2.5.3 (which you probably should use, or even the tip of b2_5)?

Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ]

I ran it on master. Wrongly updated affect/versions.

The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed.

I mimicked the steps manually
1. cd /mnt/lustre
2. created file dd if=/dev/zero of=test/tfile bs=1k count=1k
3. lfs hsm_archive --archive=1 test/tfile
4. lfs hsm_release test/tfile
5. cp --sparse=auto test/tfile test/tfile.2
6. lfs hsm_release test/tfile
7. mkdir test/tdir
8. tar cf - --sparse test/tfile | tar xvf - -C test/tdir ==> process hung on layout lock

if we skip step 6, then everything goes fine.

Even I checked with other operation (copy) instead of step 8
cp --sparse=auto test/tfile test/tdir/tfile.2 ==> this also got hung

After this I got few of the questions in my mind
1. why file is released twice ?
2. Even if the file is released twice, why the operation getting hung ?
releasing the archived file for "n" number of times should not affect.

Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs.

Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ]

logs file attached

Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ]

Got answer for
1. why file is released twice ?
To perform step 5 ( cp --sparse=auto test/tfile test/tfile.2), file is restored first then copying is done.
After this step hsm_state will be reverted to "exists" from "released". Thats why again the file is released for 2nd time.

Below record proves that

# lfs hsm_state test/tfile 
test/tfile: (0x00000009) exists archived, archive_id:1

# lfs hsm_release test/tfile 

# lfs hsm_state test/tfile 
test/tfile: (0x0000000d) released exists archived, archive_id:1

# cp test/tfile test/tfile.2
lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1
lhsmtool_posix[13264]: waiting for message from kernel
lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9
lhsmtool_posix[13296]: processing file 'test/tfile'
lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0'
lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done
lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0
lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0)

# lfs hsm_state test/tfile 
test/tfile: (0x00000009) exists archived, archive_id:1

# lfs hsm_release test/tfile

# lfs hsm_state test/tfile 
test/tfile: (0x0000000d) released exists archived, archive_id:1
Comment by Vinayak Hariharmath (Inactive) [ 18/Dec/14 ]

is it related to
https://jira.hpdd.intel.com/browse/LU-4727 ?

Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ]

It looks like the same issue of LU-4727.

Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ]

duplication of LU-4727.

Generated at Sat Feb 10 01:56:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.