Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7986

HSM: Duplicated cookies in "actions"

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Some entries can end up being duplicated in "actions". For instance:

      lrh=[type=10680000 len=136 idx=1/14] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
      lrh=[type=10680000 len=136 idx=1/51633] fid=[0x200000401:0x10:0x0] dfid=[0x200000401:0x10:0x0] compound/cookie=0x5703ea60/0x5703ea6c action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
      a
      

      As cdt_llog_lock is taken in mdt_hsm_actions_proc_show(), I think that could be a race between the coordinator/client and the proc_show() function.

      Attachments

        Issue Links

          Activity

            [LU-7986] HSM: Duplicated cookies in "actions"

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/19447
            Subject: LU-7986 hsm: update actions llog in place
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 54a07389e965a0cba7d246f8a6179591fb4b8798

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/19447 Subject: LU-7986 hsm: update actions llog in place Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 54a07389e965a0cba7d246f8a6179591fb4b8798

            Thanks.

            fzago Frank Zago (Inactive) added a comment - Thanks.
            jhammond John Hammond added a comment -

            Hi Frank,

            I'll push a patch that should address this. I'll push it after returning from LUG.

            jhammond John Hammond added a comment - Hi Frank, I'll push a patch that should address this. I'll push it after returning from LUG.

            May be that could explain why we're sometimes seeing replay of cookies. From our copytool traces, just after the hsm_recv:

            Apr  5 19:20:23 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
            Apr  5 20:34:33 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
            Apr  5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
            Apr  5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)
            
            fzago Frank Zago (Inactive) added a comment - May be that could explain why we're sometimes seeing replay of cookies. From our copytool traces, just after the hsm_recv: Apr 5 19:20:23 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:34:33 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0) Apr 5 20:37:31 [lustre] new request (cookie=57ac3d15, FID=0x2000d0ebb:0x4cd:0x0, DFID=0x2000d0ebb:0x4cd:0x0, len=72, action=20, extent=-1/0, gid=0, item=0)

            This comes from the way the llog record is updated. We could not update in place so we add a new one and remove the old one. This is done in mdt_agent_record_update_cb(): add a new record and ask the callback caller to invalid the old one.
            So if llog cleanning is not fast enough we can have multiple records for the same request.
            At HSM dev time, I did not find a way to update in place.

            jcl jacques-charles lafoucriere added a comment - This comes from the way the llog record is updated. We could not update in place so we add a new one and remove the old one. This is done in mdt_agent_record_update_cb(): add a new record and ask the callback caller to invalid the old one. So if llog cleanning is not fast enough we can have multiple records for the same request. At HSM dev time, I did not find a way to update in place.

            actions.1459874574 is "actions" before starting the copytool
            actions.1459874610 is "actions" while the copytool is running

            The list of duplicated FIDs can be seen with the following command:

            sed -e "s/.*fid=\[\(.*\)\] dfid=.*/\1/" actions.1459874610 | sort | uniq -c | sort -r | less
            
            fzago Frank Zago (Inactive) added a comment - actions.1459874574 is "actions" before starting the copytool actions.1459874610 is "actions" while the copytool is running The list of duplicated FIDs can be seen with the following command: sed -e "s/.*fid=\[\(.*\)\] dfid=.*/\1/" actions.1459874610 | sort | uniq -c | sort -r | less

            Some FIDs/cookie even appear 3 times:

            lrh=[type=10680000 len=136 idx=1/1415] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
            lrh=[type=10680000 len=136 idx=1/48820] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
            lrh=[type=10680000 len=136 idx=1/51221] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
            
            fzago Frank Zago (Inactive) added a comment - Some FIDs/cookie even appear 3 times: lrh=[type=10680000 len=136 idx=1/1415] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/48820] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[] lrh=[type=10680000 len=136 idx=1/51221] fid=[0x200000401:0x605:0x0] dfid=[0x200000401:0x605:0x0] compound/cookie=0x5703eaea/0x5703efce action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]

            For my test setup, I create a new fs (llmount.sh), and then create 30000 files and hsm_archive them, without a copytool.

            #!/bin/sh
            
            mkdir /mnt/lustre/foo
            
            echo Start $(date "+%s")
            
            for x in {1..3000}; do
            	echo loop $x $(date "+%s")
            
            	mkdir /mnt/lustre/foo/d$x
            
            	touch /mnt/lustre/foo/d$x/file{1..10}
            
            	lfs hsm_archive /mnt/lustre/foo/d$x/file*
            done
            
            echo Done $(date "+%s")
            

            I then start the following command to watch "actions":

            while [ 1 ]; do date "+%s" ; cat actions > /root/copytool_test/actions.$(date "+%s") ; time wc -l actions ; sleep 1 ; done
            

            Then I start the copytool that will eventually succeed without any error or warnings.
            But I get puzzling output such as:

            30000 actions
            
            real	0m0.199s
            user	0m0.002s
            sys	0m0.197s
            1459874583
            30000 actions
            
            real	0m0.221s
            user	0m0.003s
            sys	0m0.218s
            1459874584
            30190 actions
            
            real	0m0.957s
            user	0m0.002s
            

            30190 actions? But only 30000 were posted.

            The worst I've seen is 34267 actions instead of 30000 max.

            fzago Frank Zago (Inactive) added a comment - For my test setup, I create a new fs (llmount.sh), and then create 30000 files and hsm_archive them, without a copytool. #!/bin/sh mkdir /mnt/lustre/foo echo Start $(date "+%s") for x in {1..3000}; do echo loop $x $(date "+%s") mkdir /mnt/lustre/foo/d$x touch /mnt/lustre/foo/d$x/file{1..10} lfs hsm_archive /mnt/lustre/foo/d$x/file* done echo Done $(date "+%s") I then start the following command to watch "actions": while [ 1 ]; do date "+%s" ; cat actions > /root/copytool_test/actions.$(date "+%s") ; time wc -l actions ; sleep 1 ; done Then I start the copytool that will eventually succeed without any error or warnings. But I get puzzling output such as: 30000 actions real 0m0.199s user 0m0.002s sys 0m0.197s 1459874583 30000 actions real 0m0.221s user 0m0.003s sys 0m0.218s 1459874584 30190 actions real 0m0.957s user 0m0.002s 30190 actions? But only 30000 were posted. The worst I've seen is 34267 actions instead of 30000 max.

            People

              jhammond John Hammond
              fzago Frank Zago (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: