[LU-7988] HSM: high lock contention for cdt_llog_lock Created: 05/Apr/16 Updated: 21/Jan/19 Resolved: 06/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.7 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frank Zago (Inactive) | Assignee: | Frank Zago (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | cea | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
There is an important locking issue around cdt_llog_lock when adding new HSM requests. # time wc -l /proc/fs/lustre/mdt/snx11133-MDT0000/hsm/actions 219759 /proc/fs/lustre/mdt/snx11133-MDT0000/hsm/actions real 11m45.068s user 0m0.020s sys 0m21.372s 11 minutes to cat the list is too high. Such operation should take a couple seconds at most. The contention appears to come from the coordinator. Every time a new request is posted, the whole list of request is browsed, under that lock. That's not a problem when there is only a handful of request, but it doesn't scale when there is hundreds of thousands of them. I recompiled a centos 7 kernel with CONFIG_LOCK_STAT on a VM. I ran test creating 10000 files and archiving them without a copytool present. Total time was 146 seconds. Lock contention result: lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
&cdt->cdt_llog_lock: 6296 6296 15.45 23074.17 43436574.06 17791 27134 25.09 37745.03 138558199.24
-------------------
&cdt->cdt_llog_lock 6296 [<ffffffffa0fb096d>] cdt_llog_process+0x9d/0x3a0 [mdt]
-------------------
&cdt->cdt_llog_lock 6296 [<ffffffffa0fb096d>] cdt_llog_process+0x9d/0x3a0 [mdt]
[...]
(time units are micro-seconds). With waittime-total=43 seconds and holdtime-total=138s, this is a very contentious lock, way above the other locks in Lustre or the whole system. AFAICS, contention is between these mechanisms:
The net result is that when there is a lot of requests, they trickle down to the copytool, exacerbating the problem by increasing the number in the list. |
| Comments |
| Comment by Gerrit Updater [ 05/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19341 |
| Comment by Frank Zago (Inactive) [ 05/Apr/16 ] |
|
Above patch is only a partial fix. Maybe executing the loop once per second is too much, and should be configurable too. |
| Comment by Frank Zago (Inactive) [ 06/Apr/16 ] |
|
250000 empty files are created on a filesystem, and hsm_archive is done on them. A copytool is then started to perform these archive requests. The MDT has the patch above. mdt-cdt-250000.svg shows, the number of records completed per second over the duration of the archival. The max number was 389 at the end. The total duration was 1h:29min:39s. It shows that the processed records is getting faster over time as the backlog queue diminishes. |
| Comment by Frank Zago (Inactive) [ 06/Apr/16 ] |
|
As a side note, the creation of these 250000 files + hsm_archive by batch of 10 took 3160 seconds with this patch. Without the patch, it takes 11822 seconds. |
| Comment by Frank Zago (Inactive) [ 06/Apr/16 ] |
|
I think the graph shows that things could be improved. For instance, the MDT shouldn't need to parse the whole backlog of HSM request, as only the first ones (up to /proc/.../hsm/max_requests?) have been seen by a client. The other ones need to be checked regularly possibly for timeout or cancelation, but this is not an operation that needs to be done as frequently. In my test setup, we should be able to perform constantly at least 200 requests per seconds, as seen during the last minute, for the duration of the archival (1250 seconds total), instead of the average of 46 per seconds (5379 total). |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19585 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19584 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19578 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19579 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19580 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19581 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19582 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19583 |
| Comment by Frank Zago (Inactive) [ 14/Apr/16 ] |
|
The above series is trying to fix the issue by deferring everything that can be, so instead of calling cdt_llog_process for everything every time some new work has to be done, requests are queued and the coordinator will process them once per second. This vastly improves performance. With the patches:
Tested on a VM, with Cray's copytool. The last patch ("hsm: defer all new hals") is changing the error path and may introduce some different behaviour when error happen during restoration. |
| Comment by Andreas Dilger [ 14/Apr/16 ] |
|
Frank, I was wondering what you were looking for when submitting this patch series? All of the patches are submitted with Test-Parameters: forbuildonly which means that they will only be built but never tested. I'd have thought possibly the last patch wouldn't have this keyword so that you could get some testing on the patch series. Alternately, you could use Test-Paramters: fortestonly testlist=sanity-hsm to run only the sanity-hsm test on each of the patches? Also, the patch series was submitted twice, which clogged up the build queue a bit. I killed the remaining builds for the first version once I noticed, and all but the last ones for the second series, since they will never be tested. At least you can verify that the full patch series builds on all the architectures. |
| Comment by Frank Zago (Inactive) [ 15/Apr/16 ] |
|
Thanks Andreas. I wanted some reviews before starting the testing on these patches, which may clog your test machines. I've added testlist=sanity-hsm to the last one as you suggested. |
| Comment by Frank Zago (Inactive) [ 18/Apr/16 ] |
|
Dropping http://review.whamcloud.com/19585 as it cannot work efficiently within the current constraints. It's mostly related to restoring files as a lot of checking is done when the request is submitted and the layout locked. If we delay, then a program accessing a release file with see a ENODATA failure since the layout is not locked. This was the reason for sanity-hsm tests 10c and 11b to fail. Maybe a tinier optimization is to only queue the archive operations, which in the real world, an HSM system would mostly see. Every other operations would act as a barrier. |
| Comment by Gerrit Updater [ 29/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19886 |
| Comment by Frank Zago (Inactive) [ 29/Apr/16 ] |
|
The more I look at the coordinator code and the more I feel that using HAls internally is the wrong way to go. They should only be used for communication between nodes. When the MDT receives a message from a client, it makes a HAL out of it, which includes one or more HAI. Eventually most HAI are allocated again in mdt_agent_record_add() as a struct llog_agent_req_rec with the same info, so we might as well start there. |
| Comment by Frank Zago (Inactive) [ 03/May/16 ] |
|
Current series can perform 237 archive operation per seconds. The setup is a single VM with Lustre+Cray copytool; 0-sized files are created, hsm_archive'd and archived immediately by Cray's copytool. The bottleneck is still the processing of the requests in the coordinator, and very likely the root issue is the catalog. |
| Comment by Vinayak (Inactive) [ 09/May/16 ] |
|
Hello Frank, Andreas, http://review.whamcloud.com/#/c/19578/: Tagged build only http://review.whamcloud.com/#/c/19886/: Not containing build only tag 1. The final patch dependent on all the patches mentioned above with Tagged build only except abandoned one. Am I right ? 2. Also please let me know if I can help you in verification process or anything which will boost review process. 3. Is it not a good idea to submit the single patch aggregate of all to avoid series of conflicts ? Thanks, |
| Comment by Frank Zago (Inactive) [ 09/May/16 ] |
|
Hi Vinayak,
Yes.
As this series changes the way HSM requests are processed, care should be made to ensure the changes make sense, and no undesired behaviour is introduced.
It is much easier to review a series a patches that do separate things that a big one that does many things. |
| Comment by Vinayak (Inactive) [ 10/May/16 ] |
Okay Frank. I will port the changes to our local branch, create dependency, run the hsm tests and report you. |
| Comment by Gerrit Updater [ 17/May/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/20272 |
| Comment by Vinayak (Inactive) [ 07/Jun/16 ] |
|
Hello Frank, Sorry, I am heavily occupied with other work. I did not get time to test the patch set. Thanks, |
| Comment by Vinayak (Inactive) [ 24/Jun/16 ] |
|
Hello Frank, I have tried to reproduce this issue on master(without the patch) using following script by creating 1000 files (I thought 1000 files would take some big time) but it did not take much time for me. Finished in few fraction seconds. btw, I am trying it on single node set up with 1 MDS and 2 OSTs. Please let me know if I am missing anything.
numfiles=1000
echo "Creating $numfiles files"
for i in $(seq 1 $numfiles); do
make_small $DIR/f501.sanity-hsm$i
done
local fid=$(path2fid $DIR/f501.sanity-hsm$numfiles)
echo "fid of f501.sanity-hsm$numfiles = $fid"
echo "Archiving the files"
for i in $(seq 1 $numfiles); do
$LFS hsm_archive $DIR/f501.sanity-hsm$i
done
sleep 5
local time=$(time wc -l /proc/fs/lustre/mdt/lustre-MDT0000/hsm/actions)
wait_request_state $fid ARCHIVE SUCCEED
Thanks, |
| Comment by Frank Zago (Inactive) [ 24/Jun/16 ] |
|
Hi Vinayak, You need a lot more files. Go with at least 100000. I usually test with 250000, and got to 1M files. Let the queue fill up then start the copytool. Then you'll see how slow files are being processed. Use small files (eg 1 byte) so the copytool isn't the choking point. Here's a script I use: #!/bin/sh
mkdir /mnt/lustre/foo
start=$(date "+%s")
echo Start $start
for x in {1..6250}; do
echo loop $x $(($(date "+%s") - start)) seconds since start $((x * 40 / ($(date "+%s") - start + 1) ))
mkdir /mnt/lustre/foo/d$x
cd /mnt/lustre/foo/d$x
touch f-{1..40}
lfs hsm_archive f-{1..40}
done
echo Done $start to $(date "+%s")
With that kind of setup, you may run into LU-5228 too. |
| Comment by Vinayak (Inactive) [ 27/Jun/16 ] |
|
Thanks Frank. I will try that. |
| Comment by Vinayak (Inactive) [ 29/Jun/16 ] |
|
Hello Frank, Update: With the reproducer I could able to experience the delay of 2 mins 33.750 secs with 250,000 files.
Tried the same reproducer with patch1 , patch2 and observed the delay of 56.946 secs which is better value.
Thanks, |
| Comment by Frank Zago (Inactive) [ 29/Jun/16 ] |
|
Hi Vinayak, |
| Comment by Vinayak (Inactive) [ 30/Jun/16 ] |
|
Hello Frank, Sorry for the wrong link tagged in my last comment. Yes, I am using the same patch you indicated there i.e http://review.whamcloud.com/#/c/19886 Just a thought, would it be better to combine these two patches (patch1, patch2) and make it one as these two mainly contribute to the cause and remaining patches can be pushed over this. copied from your comment:
I agree but merge conflicts cause other patches also to be re-based as they are dependent on each other. IMO, we can reduce the number of patches, may be to 3. Thanks, |
| Comment by Gerrit Updater [ 06/Jul/16 ] |
|
Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: http://review.whamcloud.com/21173 |
| Comment by Gerrit Updater [ 06/Jul/16 ] |
|
Jean-Baptiste Riaux (riaux.jb@intel.com) uploaded a new patch: http://review.whamcloud.com/21174 |
| Comment by Jean-Baptiste Riaux (Inactive) [ 07/Jul/16 ] |
|
Sorry, made a mistake while merging patches. Patches http://review.whamcloud.com/21173 and http://review.whamcloud.com/21174 abandoned. |
| Comment by Gerrit Updater [ 01/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19578/ |
| Comment by Gerrit Updater [ 11/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19579/ |
| Comment by Gerrit Updater [ 11/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19581/ |
| Comment by Gerrit Updater [ 11/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19583/ |
| Comment by Frank Zago (Inactive) [ 11/Aug/16 ] |
|
Series rebased: Current patches, in order, are: http://review.whamcloud.com/19341 |
| Comment by Frank Zago (Inactive) [ 21/Sep/16 ] |
|
Adding a new patch to the series as the remaining issue seems to be around the shutdown of the coordinator, which the new code appears to trigger. |
| Comment by Gerrit Updater [ 21/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22667 |
| Comment by Bruno Faccini (Inactive) [ 22/Sep/16 ] |
|
Frank, shutdown of CDT is a long time known issue (particularly the fact that CDT stuff/structs cleanup could occur from both MDT umount thread and/or CDT thread itself, and I have missed time to fix it myself) and your last #22667 seems to be a right way to fix it. I only have an additional concern about if cdt_state tests/updates should not need to be protected to fully avoid racy cases ? |
| Comment by Gerrit Updater [ 23/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22704 |
| Comment by Gerrit Updater [ 27/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22761 |
| Comment by Gerrit Updater [ 27/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22762 |
| Comment by Gerrit Updater [ 28/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22789 |
| Comment by Gerrit Updater [ 29/Sep/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/22823 |
| Comment by Frank Zago (Inactive) [ 04/Oct/16 ] |
|
This series now passes the tests again. |
| Comment by Gerrit Updater [ 08/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22704/ |
| Comment by Gerrit Updater [ 23/Dec/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22761/ |
| Comment by Cory Spitz [ 24/Apr/17 ] |
|
Some stuff has landed for 2.10, other stuff remains pending. What's the current list of changes that is still pending? |
| Comment by Gerrit Updater [ 17/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22667/ |
| Comment by Peter Jones [ 19/May/17 ] |
|
As discussed in the LWG this week, work in this area is ongoing and those ready before code freeze will still land but the prospects of all landing in the remaining time look unlikely so they will end up in a subsequent maintenance release |
| Comment by Gerrit Updater [ 24/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22789/ |
| Comment by Gerrit Updater [ 19/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22762/ |
| Comment by Gerrit Updater [ 26/Jul/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28219 |
| Comment by Gerrit Updater [ 01/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/19341/ |
| Comment by Gerrit Updater [ 03/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28219/ |
| Comment by Gerrit Updater [ 04/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28368 |
| Comment by Gerrit Updater [ 10/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28368/ |
| Comment by Gerrit Updater [ 31/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/19582/ |
| Comment by Gerrit Updater [ 08/Sep/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28908 |
| Comment by Gerrit Updater [ 13/Sep/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/19584/ |
| Comment by Gerrit Updater [ 13/Sep/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28973 |
| Comment by Gerrit Updater [ 09/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/20272/ |
| Comment by Gerrit Updater [ 12/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29583 |
| Comment by Gerrit Updater [ 24/Oct/17 ] |
|
Ben Evans (bevans@cray.com) uploaded a new patch: https://review.whamcloud.com/29742 |
| Comment by Gerrit Updater [ 26/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28973/ |
| Comment by Gerrit Updater [ 06/Nov/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29742/ |
| Comment by Peter Jones [ 06/Nov/17 ] |
|
As we finally seem to be at an end of patches queued up for this ticket let's close it and open a new ticket to track any new fixes identified in this area of code in the future. |
| Comment by Gerrit Updater [ 14/Dec/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30538 |
| Comment by Gerrit Updater [ 19/Dec/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30538/ |
| Comment by Gerrit Updater [ 26/Feb/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29583/ |
| Comment by Aurelien Degremont (Inactive) [ 14/Dec/18 ] |
|
It seems this ticket was wrongly flagged to 2.10.6. The last patch being landed on b2_10 from this series was landed for 2.10.4 There is still one backport which was not landed and is interesting: https://review.whamcloud.com/#/c/28908/2 |
| Comment by Gerrit Updater [ 19/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/28908/ |