Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.6.0
-
Scientific Linux 6.5, Lutre 2.6.0, 1 MDS, 2 OSSs, 2 clients, all VMs hosted on KVM.
Also reproducible with CentOS 6.5 in other environment.
-
3
-
15920
Description
During some tests, an issue has been found that when a file is archived, released, changed and re-archived, the re-archival fails under some circumstances. The problem could be reproduced in a very simple way using the lhsmd_posix:
Boot machines, start Lustre services (MDS, OSS1, OSS2, mount on both clients)
Start copytool on client 2:
lhsmtool_posix -A 3 -A 4 -p /tmp /lustre
Start test script on client 1:
arch=3 rm -f /lustre/mytest echo thisfile >/lustre/mytest lfs hsm_archive -a $arch /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 lfs hsm_release /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 cat /etc/passwd >>/lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 3 lfs hsm_archive -a $arch /lustre/mytest sleep 3 lfs hsm_state /lustre/mytest sleep 5 lfs hsm_state /lustre/mytest
Test script output
/lustre/mytest: (0x00000009) exists archived, archive_id:3 /lustre/mytest: (0x0000000d) released exists archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3 /lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
Observations while watching the MDS syslog:
Sep 17 10:39:06 lustre-mds1 kernel: LustreError: 1396:0:(mdt_coordinator.c:1463:mdt_hsm_update_request_state()) lsttest-MDT0000: Cannot find running request for cookie 0x54194821 on fid=[0x200003ab0:0x1:0x0]
Observations on the copytool side:
lhsmtool_posix[1230]: action=0 src=(null) dst=(null) mount_point=/lustre lhsmtool_posix[1230]: waiting for message from kernel lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1 lhsmtool_posix[1230]: waiting for message from kernel lhsmtool_posix[1231]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x5419481f lhsmtool_posix[1231]: processing file 'mytest' lhsmtool_posix[1231]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1231]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov lhsmtool_posix[1231]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1231]: data archiving for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' done lhsmtool_posix[1231]: attr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1231]: fsetxattr of 'trusted.hsm' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success) lhsmtool_posix[1231]: fsetxattr of 'trusted.link' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success) lhsmtool_posix[1231]: fsetxattr of 'trusted.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success) lhsmtool_posix[1231]: fsetxattr of 'trusted.lma' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=0 (Success) lhsmtool_posix[1231]: fsetxattr of 'lustre.lov' on '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[1231]: xattr file for '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' saved to archive '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1231]: symlink '/tmp/shadow/mytest' to '../0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' done lhsmtool_posix[1231]: Action completed, notifying coordinator cookie=0x5419481f, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0 lhsmtool_posix[1231]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0) lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1 lhsmtool_posix[1230]: waiting for message from kernel lhsmtool_posix[1232]: '[0x200003ab0:0x1:0x0]' action RESTORE reclen 72, cookie=0x54194820 lhsmtool_posix[1232]: processing file 'mytest' lhsmtool_posix[1232]: reading stripe rules from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0.lov' for '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' lhsmtool_posix[1232]: restoring data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]' lhsmtool_posix[1232]: going to copy data from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]' lhsmtool_posix[1232]: data restore from '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0' to '{VOLATILE}=[0x200003ab1:0x1:0x0]' done lhsmtool_posix[1232]: Action completed, notifying coordinator cookie=0x54194820, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0 lhsmtool_posix[1232]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' ok (rc=0) lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1 lhsmtool_posix[1230]: waiting for message from kernel lhsmtool_posix[1233]: '[0x200003ab0:0x1:0x0]' action ARCHIVE reclen 72, cookie=0x54194821 lhsmtool_posix[1233]: processing file 'mytest' lhsmtool_posix[1233]: archiving '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1233]: saving stripe info of '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov lhsmtool_posix[1233]: going to copy data from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' lhsmtool_posix[1233]: progress ioctl for copy '/lustre/.lustre/fid/0x200003ab0:0x1:0x0'->'/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp' failed: No such file or directory (2) lhsmtool_posix[1233]: data copy failed from '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' to '/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp': No such file or directory (2) lhsmtool_posix[1233]: Action completed, notifying coordinator cookie=0x54194821, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=2 lhsmtool_posix[1233]: llapi_hsm_action_end() on '/lustre/.lustre/fid/0x200003ab0:0x1:0x0' failed: No such file or directory (2) Immediately after the test, mdt/*/hsm/actions shows the following: lrh=[type=10680000 len=136 idx=11/3] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x5419481f/0x5419481f action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] lrh=[type=10680000 len=136 idx=11/6] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194820/0x54194820 action=RESTORE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] lrh=[type=10680000 len=136 idx=11/9] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194821/0x54194821 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=FAILED data=[]
A second, later retry of the archive request sometimes works fine and it also makes a difference if the copytool and the command line actions run on the same client node or on different nodes.
I already added this as a comment to the possibly related LU-4004, but as this is closed, it doesn't seem to get attention any longer.
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14377/
Subject:
LU-5683llite: Inform copytool of dataversion changesProject: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f99c0178e6d90bc6674f9d6a4e724b2c113b4a43