[LU-7986] HSM: Duplicated cookies in "actions" Created: 05/Apr/16 Updated: 10/Apr/17 Resolved: 21/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Frank Zago (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Some entries can end up being duplicated in "actions". For instance: lrh=[type=10680000 len=136 idx=1/14] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/51633] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[] a As cdt_llog_lock is taken in mdt_hsm_actions_proc_show(), I think that could be a race between the coordinator/client and the proc_show() function. |
| Comments |
| Comment by Frank Zago (Inactive) [ 05/Apr/16 ] |
|
For my test setup, I create a new fs (llmount.sh), and then create 30000 files and hsm_archive them, without a copytool. #!/bin/sh
mkdir /mnt/lustre/foo
echo Start $(date "+%s")
for x in {1..3000}; do
echo loop $x $(date "+%s")
mkdir /mnt/lustre/foo/d$x
touch /mnt/lustre/foo/d$x/file{1..10}
lfs hsm_archive /mnt/lustre/foo/d$x/file*
done
echo Done $(date "+%s")
I then start the following command to watch "actions": while [ 1 ]; do date "+%s" ; cat actions > /root/copytool_test/actions.$(date "+%s") ; time wc -l actions ; sleep 1 ; done Then I start the copytool that will eventually succeed without any error or warnings. 30000 actions real 0m0.199s user 0m0.002s sys 0m0.197s 1459874583 30000 actions real 0m0.221s user 0m0.003s sys 0m0.218s 1459874584 30190 actions real 0m0.957s user 0m0.002s 30190 actions? But only 30000 were posted. The worst I've seen is 34267 actions instead of 30000 max. |
| Comment by Frank Zago (Inactive) [ 05/Apr/16 ] |
|
Some FIDs/cookie even appear 3 times: lrh=[type=10680000 len=136 idx=1/1415] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/48820] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[] lrh=[type=10680000 len=136 idx=1/51221] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[] |
| Comment by Frank Zago (Inactive) [ 05/Apr/16 ] |
|
actions.1459874574 is "actions" before starting the copytool The list of duplicated FIDs can be seen with the following command: sed -e "s/.*fid=\[\(.*\)\] dfid=.*/\1/" actions.1459874610 | sort | uniq -c | sort -r | less |
| Comment by jacques-charles lafoucriere [ 06/Apr/16 ] |
|
This comes from the way the llog record is updated. We could not update in place so we add a new one and remove the old one. This is done in mdt_agent_record_update_cb(): add a new record and ask the callback caller to invalid the old one. |
| Comment by Frank Zago (Inactive) [ 06/Apr/16 ] |
|
May be that could explain why we're sometimes seeing replay of cookies. From our copytool traces, just after the hsm_recv: Apr 5 19:20:23 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:34:33 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) |
| Comment by John Hammond [ 07/Apr/16 ] |
|
Hi Frank, I'll push a patch that should address this. I'll push it after returning from LUG. |
| Comment by Frank Zago (Inactive) [ 07/Apr/16 ] |
|
Thanks. |
| Comment by Gerrit Updater [ 11/Apr/16 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/19447 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19577 |
| Comment by Gerrit Updater [ 21/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19447/ |
| Comment by Joseph Gmitter (Inactive) [ 21/Apr/16 ] |
|
Landed to master for 2.9.0 |