HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3616] HSM restore for execute allows writes to file Created: 22/Jul/13  Updated: 31/Dec/13  Resolved: 25/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Technical task Priority: Major
Reporter: John Hammond Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Rank (Obsolete): 9297

 Description   

Using the Jul 22, 2013 HSM stack, executing a released file (and thereby triggering a restore) leaves the file writable while it's being executed.

# cd /mnt/lustre
# cp /bin/sleep SLEEP
# lfs hsm_archive SLEEP
# sleep 1
# lfs hsm_release SLEEP
# ./SLEEP 10 && echo DONE &
[1] 4243
# sleep 1
# pgrep -l SLEEP
4244 SLEEP
# cd /mnt/lustre2
# echo 'Hi!' > SLEEP
# cat SLEEP
Hi!
# -bash: line 238:  4244 Bus error               (core dumped) ./SLEEP 10

[1]+  Exit 135                ./SLEEP 10 && echo DONE  (wd: /mnt/lustre)
(wd now: /mnt/lustre2)


 Comments   
Comment by Bruno Faccini (Inactive) [ 05/Sep/13 ]

Normal (without HSM actions/cmds) behavior would be to have "echo 'Hi!' > SLEEP" fail with "Text file busy"/ETXTBSY.

And dual/lustre2 mount access is the key ...

I am walking thru the code to see where we missed something during hsm_release.

Comment by Bruno Faccini (Inactive) [ 11/Sep/13 ]

This behavior has been introduced in both mdt_mfd_open()/mdt_object_open_lock() routine (in lustre/mdt/mdt_open.c) by commit c42b426c87c3d3b1dc9eda612cc831293dc80d68 from Gerrit patch/Change-Id Ic8f82ddc9a56206307c2e5be2523fb7ce42b8638 (at http://review.whamcloud.com/3035) for LU-1338 (now HSM-5) ticket.

And Oleg already warned about this in its Change comment !

I wonder if I can simply revert these changes to get the correct behavior, and I would like to get Aurelien (since he is the original change author) feed-back on this.

Comment by Aurelien Degremont (Inactive) [ 12/Sep/13 ]

I did not write this part of the patch, but it seems it could be change. I'm trusting Oleg regarding this.
If this fix the code snippet you've posted, I'm fine. Just ensure restore at exec it is still working.

Comment by Bruno Faccini (Inactive) [ 25/Sep/13 ]

1st patch attempt is at http://review.whamcloud.com/7636. Build is ok but auto-tests never started ...
So, I just re-triggered auto-tests.

Comment by Bruno Faccini (Inactive) [ 02/Oct/13 ]

1st patch-set of http://review.whamcloud.com/7636 successfully passed auto-tests and also did not trigger the original problem when running John's reproducer.

I will submit a new version/patch-set #2 with the same code but adding a specific+new sub-test in sanity-hsm, based on John's reproducer.

Comment by Bruno Faccini (Inactive) [ 11/Oct/13 ]

Patch-set #2 of Change #7636 successfully passed auto-tests including its own+new sanity-hsm/test_30c sub-test.

This allows restore on exec() to continue to work but now prevents any write to be allowed during exec() and make it fail.

BTW, reading code of sub-tests test_30[a,b], against same exec() on released files area, I have been surprised by the following comment :

# restore at exec cannot work on agent node (because of Linux kernel
# protection of executables)
needclients 2 || return 0
...

at their beginning.
Is it (comment and "needclients 2") still of actuality, because as per my latest tests, restore at exec() also works on Agent-Node (I mean I tested on a single+full node ...) ?

Comment by Peter Jones [ 25/Oct/13 ]

Landed for 2.6

Comment by Aurelien Degremont (Inactive) [ 12/Nov/13 ]

This should also be considered for 2.5.1

Comment by Peter Jones [ 12/Nov/13 ]

Yes it is being tracked for 2.5.1.

Generated at Sat Feb 10 01:35:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.