[LU-7986] HSM: Duplicated cookies in "actions" Created: 05/Apr/16  Updated: 10/Apr/17  Resolved: 21/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Frank Zago (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Attachments: File actions.tar.xz    
Issue Links:
Related
is related to LU-9312 HSM coordinator llog scalability Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Some entries can end up being duplicated in "actions". For instance:

lrh=[type=10680000 len=136 idx=1/14] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
lrh=[type=10680000 len=136 idx=1/51633] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
a

As cdt_llog_lock is taken in mdt_hsm_actions_proc_show(), I think that could be a race between the coordinator/client and the proc_show() function.



 Comments   
Comment by Frank Zago (Inactive) [ 05/Apr/16 ]

For my test setup, I create a new fs (llmount.sh), and then create 30000 files and hsm_archive them, without a copytool.

#!/bin/sh

mkdir /mnt/lustre/foo

echo Start $(date "+%s")

for x in {1..3000}; do
	echo loop $x $(date "+%s")

	mkdir /mnt/lustre/foo/d$x

	touch /mnt/lustre/foo/d$x/file{1..10}

	lfs hsm_archive /mnt/lustre/foo/d$x/file*
done

echo Done $(date "+%s")

I then start the following command to watch "actions":

while [ 1 ]; do date "+%s" ; cat actions > /root/copytool_test/actions.$(date "+%s") ; time wc -l actions ; sleep 1 ; done

Then I start the copytool that will eventually succeed without any error or warnings.
But I get puzzling output such as:

30000 actions

real	0m0.199s
user	0m0.002s
sys	0m0.197s
1459874583
30000 actions

real	0m0.221s
user	0m0.003s
sys	0m0.218s
1459874584
30190 actions

real	0m0.957s
user	0m0.002s

30190 actions? But only 30000 were posted.

The worst I've seen is 34267 actions instead of 30000 max.

Comment by Frank Zago (Inactive) [ 05/Apr/16 ]

Some FIDs/cookie even appear 3 times:

lrh=[type=10680000 len=136 idx=1/1415] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
lrh=[type=10680000 len=136 idx=1/48820] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/51221] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
Comment by Frank Zago (Inactive) [ 05/Apr/16 ]

actions.1459874574 is "actions" before starting the copytool
actions.1459874610 is "actions" while the copytool is running

The list of duplicated FIDs can be seen with the following command:

sed -e "s/.*fid=\[\(.*\)\] dfid=.*/\1/" actions.1459874610 | sort | uniq -c | sort -r | less
Comment by jacques-charles lafoucriere [ 06/Apr/16 ]

This comes from the way the llog record is updated. We could not update in place so we add a new one and remove the old one. This is done in mdt_agent_record_update_cb(): add a new record and ask the callback caller to invalid the old one.
So if llog cleanning is not fast enough we can have multiple records for the same request.
At HSM dev time, I did not find a way to update in place.

Comment by Frank Zago (Inactive) [ 06/Apr/16 ]

May be that could explain why we're sometimes seeing replay of cookies. From our copytool traces, just after the hsm_recv:

Apr  5 19:20:23 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
Apr  5 20:34:33 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
Apr  5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
Apr  5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
Comment by John Hammond [ 07/Apr/16 ]

Hi Frank,

I'll push a patch that should address this. I'll push it after returning from LUG.

Comment by Frank Zago (Inactive) [ 07/Apr/16 ]

Thanks.

Comment by Gerrit Updater [ 11/Apr/16 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/19447
Subject: LU-7986 hsm: update actions llog in place
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54a07389e965a0cba7d246f8a6179591fb4b8798

Comment by Gerrit Updater [ 14/Apr/16 ]

Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19577
Subject: LU-7986 hsm: update actions llog in place
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4d96fb6b56a0d40aa721a67351bea6a62db97b9

Comment by Gerrit Updater [ 21/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19447/
Subject: LU-7986 hsm: update actions llog in place
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 91144acb3dc1120c00797269afa621c94cb64e1e

Comment by Joseph Gmitter (Inactive) [ 21/Apr/16 ]

Landed to master for 2.9.0

Generated at Sat Feb 10 02:13:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.