[LU-5655] lhsmtool_posix (copy tool agent) does not provide facility to un register Created: 24/Sep/14  Updated: 25/Feb/15  Resolved: 25/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: None

Type: Improvement Priority: Critical
Reporter: Vinayak Hariharmath (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: HSM
Environment:

centos 6.5


Attachments: HTML File lhs_posix_messages    
Epic: client
Project: HSM
Rank (Obsolete): 15853

 Description   

steps :
1. Mounted lustre
2. enabled hsm

    # lctl set_param mdt.lustre-MDT0000.hsm_control=enabled

3. Started copy tool daemon (only one copy tool agent can run on a client )

    #lhsmtool_posix --daemon --hsm-root /tmp/HSM --archive=1 /mnt/lustre/

4. Only way to stop agent is to send TERM signal to agent as per lustre manual. So I killed it (as I wanted to run modified copy tool agent)

    # ps -ef | grep lhs
       root      4017     1  0 16:54 ?        00:00:00 lhsmtool_posix --daemon --hsm-root /tmp/HSM --archive=1 /mnt/lustre/
       root      4045  2110  0 16:55 pts/1    00:00:00 grep lhs

5. Now I tried to start new copy tool agent

     #lhsmtool_posix --daemon --hsm-root /tmp/HSM --archive=1 /mnt/lustre/ 

But got below message from kernel

    Sep 11 16:55:34 localhost kernel: Lustre: HSM agent b150c068-22f2-83cd-21b0-2b4e76a3082a already registered.


 Comments   
Comment by Vinayak Hariharmath (Inactive) [ 24/Sep/14 ]

I feel once the daemon is killed , it should get unregistered.

Comment by Andreas Dilger [ 26/Sep/14 ]

Is this a new problem in 2.6.0 or does the same problem exist in 2.5?

Comment by Robert Read (Inactive) [ 29/Sep/14 ]

I'm not able to reproduce this on 2.5.3. Whenever I kill the daemon, it stops running and the is no longer registered.

It's not clear in the description that the ps in step 4 was run before or after you killed the copytool. If it is after, then clearly the daemon is still running for some reason and so it is still registered.

Comment by Vinayak Hariharmath (Inactive) [ 14/Oct/14 ]

ps in step 4 was run before I killed copy tool after that I killed it. Later I tried to start copy tool again but I got below message. Using b2_6 branch.

Oct 14 17:53:59 localhost kernel: Lustre: HSM agent 0f4d5578-4a43-df67-a8b1-ce31a2c2cd3a already registered

Comment by Robert Read (Inactive) [ 30/Oct/14 ]

I'm seeing this on b2_6 as well, so this appears to be new in 2.6.0.

Comment by Jodi Levi (Inactive) [ 02/Dec/14 ]

Bruno,
Can you please have a look at this one?
Thank you!

Comment by Bruno Faccini (Inactive) [ 03/Dec/14 ]

Ok. I just wonder if copytool death and unregister could not take longer than before in b2_6 and then if this ticket could be related to LU-5622 ? But then, I also wonder why this does not currently affect auto-tests sanity-hsm results ?

Comment by Bruno Faccini (Inactive) [ 08/Dec/14 ]

BTW I am unable to reproduce it with master/b2_6 builds.
Robert, can you help me and detail the exact versions ad configuration you used to be able to reproduce ?

Comment by Robert Read (Inactive) [ 08/Dec/14 ]

I just tried again on a recent master build and wasn't able to reproduce it, either. I don't recall exact version of 2.6 I had been using, but the configuration would have been a single node setup using llmount.sh plus an additional mountpoint for the copytool.

Comment by Bruno Faccini (Inactive) [ 11/Dec/14 ]

Vinayak, are you still able to reproduce the problem at your site?
If yes can you help and provide more infos, because I am unable to reproduce for the moment?
It would be nice if before running again your test/reproducer you can enable the full Lustre debug-log mask on both the CLient/Agent and MDS nodes, and then to provide them, can you do this ?

Comment by Vinayak Hariharmath (Inactive) [ 11/Dec/14 ]

Sure. I will verify it on my side and update the bug. Sorry I was stuck with other work and could not able update it.

Comment by Bruno Faccini (Inactive) [ 20/Jan/15 ]

Vinayak, Any update ? Did you spend some time for more testing about this issue ? Could it be, as I suspected above, that the copytool unregister could take longer ?

Comment by Vinayak Hariharmath (Inactive) [ 21/Jan/15 ]

Hi

Yes I have spent bit time on it to verify.
The issue is not observed on master but still dmesg gives

Lustre: HSM agent 277f9613-7a7d-1caa-5849-0b5ffe11fdbb already registered

but there is no problem with copy tool which works fine (Earlier with the above error message, copy tool was failing to start after killing it). I guess it only a grammar correction. Dmesg logs attached.

Comment by Bruno Faccini (Inactive) [ 24/Feb/15 ]

Hello Vinayak,
Do you agree if close this ticket as "Cannot reproduce" ?

Comment by Vinayak Hariharmath (Inactive) [ 25/Feb/15 ]

Hi Bruno,

We can close this as "Cannot reproduce", but there is one more ticket LU-5216 looks bit similar to this issue. I am trying to reproduce it and draw some relation with this.

Thanks

Comment by Bruno Faccini (Inactive) [ 25/Feb/15 ]

According to the description in LU-5216, then we may easily think that if both tickets are related, this one would only be a consequence of LU-5216 due to some orphan copytool thread still present or related data structure deferred/missing cleanup.
So closing this one.

Generated at Sat Feb 10 01:53:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.