[LU-4004] CLONE - flow control of HSM requests Created: 24/Sep/13 Updated: 17/Sep/14 Resolved: 13/Nov/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | HSM | ||
| Severity: | 3 |
| Rank (Obsolete): | 10716 |
| Description |
|
In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems. the first problem is the lprocfs implementation of agent_action. The symptom is: [root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions
error: get_param: read('/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions') failed: Cannot allocate memory
Though I didn't look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason. I think the more severe problem is flow control. It's not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be. Another problem I saw in the test is that: LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0] LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages There were a huge number of this warning. I will dig it tomorrow |
| Comments |
| Comment by Jodi Levi (Inactive) [ 24/Sep/13 ] |
|
Jinshan Xiong added a comment - 09/Sep/13 6:07 PM - edited |
| Comment by Jodi Levi (Inactive) [ 24/Sep/13 ] |
|
This ticket is being created for the follow on work to be completed in the 2.6 Release. |
| Comment by Jodi Levi (Inactive) [ 13/Nov/13 ] |
|
Determined that no follow on work is needed. |
| Comment by James Nunez (Inactive) [ 06/May/14 ] |
|
During scale testing on IEEL build #26, I see the second problem reported above, seen on the MDT dmesg: LustreError: 23286:0:(mdt_coordinator.c:1465:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x53682fe2 on fid=[0x200000405:0x1fc8:0x0] On the client node running the test, I see the following many times on the console: Cannot send HSM request (use of /lustre/scratch/tdir/e.00456): Operation not permitted The script being run has 4 processes writing a file and computing the checksum. Then the file is archived and released. At the same time there are three other processes checking if files are released and, if so, restore the file by checking the MD5 checksum. This is running on the OpenSFS cluster with a Lustre file system storing the data and a Lustre file system for the archive and a single agent. I don't have enough information to say what operation was not permitted, but is was either hsm_state, hsm_archive or hsm_release. Should I open a new ticket for this? |
| Comment by Michael Kukat [ 17/Sep/14 ] |
|
I didn't find a new ticket for this and this one looks very related to a problem we have seen yesterday. I hope, i'm not wrong here. Environment: 1 MDS, 2 OSSs, 2 clients, RHEL 6.5, Lustre 2.6.0, all VMs hosted on KVM. Start copytool on client 2: lhsmtool_posix -A 3 -A 4 -p /tmp /lustre Start test script on client 1: arch=3 rm -f /lustre/mytest echo thisfile >/lustre/mytest lfs hsm_archive -a $arch /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 lfs hsm_release /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 cat /etc/passwd >>/lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 lfs hsm_archive -a $arch /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 5 lfs hsm_state /lustre/mytest Run test script /lustre/mytest: (0x00000009) exists archived, archive_id:3 /lustre/mytest: (0x0000000d) released exists archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3 while watching the MDS syslog: Sep 17 10:39:06 lustre-mds1 kernel: LustreError: 1396:0:(mdt_coordinator.c:1463:mdt_hsm_update_request_state()) lsttest-MDT0000: Cannot find running request for cookie 0x54194821 on fid=[0x200003ab0:0x1:0x0] and the copytool: lhsmtool_posix[1230]: action=0 src=(null) dst=(null) mount_point=/lustre
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1231]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x5419481f
lhsmtool_posix[1231]: processing file 'mytest'
lhsmtool_posix[1231]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1231]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
lhsmtool_posix[1231]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1231]: data archiving for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' done
lhsmtool_posix[1231]: attr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1231]: fsetxattr of 'trusted.hsm' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of 'trusted.link' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of 'trusted.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of 'trusted.lma' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of 'lustre.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=-1 (Operation not supported)
lhsmtool_posix[1231]: xattr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1231]: symlink '/tmp/shadow/mytest' to '../0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' done
lhsmtool_posix[1231]: Action completed, notifying coordinator cookie=0x5419481f, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
lhsmtool_posix[1231]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0)
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1232]: '[0x200003ab0:0x1:0x0]' action RESTORE reclen 72, cookie=0x54194820
lhsmtool_posix[1232]: processing file 'mytest'
lhsmtool_posix[1232]: reading stripe rules from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0.lov' for '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0'
lhsmtool_posix[1232]: restoring data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]'
lhsmtool_posix[1232]: going to copy data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]'
lhsmtool_posix[1232]: data restore from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]' done
lhsmtool_posix[1232]: Action completed, notifying coordinator cookie=0x54194820, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
lhsmtool_posix[1232]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0)
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1233]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x54194821
lhsmtool_posix[1233]: processing file 'mytest'
lhsmtool_posix[1233]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1233]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
lhsmtool_posix[1233]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp'
lhsmtool_posix[1233]: progress ioctl for copy '/lustre/.lustre/fid/0x200003ab0:0x1:0x0'->'/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' failed: No such file or directory (2)
lhsmtool_posix[1233]: data copy failed from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp': No such file or directory (2)
lhsmtool_posix[1233]: Action completed, notifying coordinator cookie=0x54194821, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=2
lhsmtool_posix[1233]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' failed: No such file or directory (2)
Immediately after the test, mdt/*/hsm/actions shows the following: lrh=[type=10680000 len=136 idx=11/3] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x5419481f/0x5419481f action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] lrh=[type=10680000 len=136 idx=11/6] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194820/0x54194820 action=RESTORE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] lrh=[type=10680000 len=136 idx=11/9] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194821/0x54194821 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=FAILED data=[] A second, later retry of the archive request works fine, so there are ways to work around this issue, but it still would be nice if archival of a changed file would work on the first try. |