[LU-12335] mb_prealloc_table table read/write code is racy - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.13.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Preallocation table read/write code is racy. There is a possibility of accessing memory outside of allocated table.
This issue can be easy reproduced. I am not sure, I have to upload test that lead to test system to be crashed. So I put it here.

dd if=/dev/zero of=<path_to_ldiskfs_partition> bs=1048576 count=1024 conv=fsync
cat "32 64 128 256" > /proc/fs/ldiskfs/<dev>/prealloc_table

Attachments

Issue Links

is related to

LU-12511 Prepare lustre for adoption into the linux kernel

Open

is related to

LU-12103 Improve block allocation for large partitions

Resolved

Activity

[LU-12335] mb_prealloc_table table read/write code is racy

Artem Blagodarenko (Inactive) added a comment - 10/Sep/19 2:00 PM

No need send this patch to ext4 upstream because no such bug there. Bug was introduced in our ldiskfs patches.

Artem Blagodarenko (Inactive) added a comment - 10/Sep/19 2:00 PM No need send this patch to ext4 upstream because no such bug there. Bug was introduced in our ldiskfs patches.

James A Simmons added a comment - 10/Sep/19 1:58 PM

Please make sure this is push to the ext4 maintainers.

James A Simmons added a comment - 10/Sep/19 1:58 PM Please make sure this is push to the ext4 maintainers.

Peter Jones added a comment - 10/Sep/19 1:45 PM

As per recent LWG discussion this ticket should be marked as RESOLVED and anyone wanting to keep SLES/Ubuntu servers in sync should do that under a separate ticket

Peter Jones added a comment - 10/Sep/19 1:45 PM As per recent LWG discussion this ticket should be marked as RESOLVED and anyone wanting to keep SLES/Ubuntu servers in sync should do that under a separate ticket

James A Simmons added a comment - 16/Jun/19 2:44 PM - edited

This fix was only every applied to RHEL platforms. SLES and Ubuntu lack this fix.

James A Simmons added a comment - 16/Jun/19 2:44 PM - edited This fix was only every applied to RHEL platforms. SLES and Ubuntu lack this fix.

Peter Jones added a comment - 29/May/19 12:58 PM

Landed for 2.13

Peter Jones added a comment - 29/May/19 12:58 PM Landed for 2.13

Gerrit Updater added a comment - 29/May/19 4:24 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34950/
Subject: ~~LU-12335~~ ldiskfs: fixed size preallocation table
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f15995b8e52bafabe55506ad2e12c8a64a373948

Gerrit Updater added a comment - 29/May/19 4:24 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34950/ Subject: LU-12335 ldiskfs: fixed size preallocation table Project: fs/lustre-release Branch: master Current Patch Set: Commit: f15995b8e52bafabe55506ad2e12c8a64a373948

Artem Blagodarenko (Inactive) added a comment - 25/May/19 5:31 AM

The algorithm is not difficult, as you can see in script. So, can be added to kernel. The most difficult diction - moment then we need to reconfigure preallocation table. With script, administrator decide, then change configuration.

> (hard for most users to configure, can die if there are problems (e.g. OOM),

My suggestion, add to cluster scripts and adjust automatically.

>become CPU starved if the server is busy

Preallocation table changing is quite fast operation, and with patch https://review.whamcloud.com/34950, safe and lockless.

>needs extra scanning to learn current filesystem state and may become out of sync with the kernel).

Scanning is made by kernel. Script use "/proc/fs/ldiskfs/loop1/mb_groups" output. This statistic is perfect data for such decision. Anyway, even in kernel we need use this statistic.

Artem Blagodarenko (Inactive) added a comment - 25/May/19 5:31 AM The algorithm is not difficult, as you can see in script. So, can be added to kernel. The most difficult diction - moment then we need to reconfigure preallocation table. With script, administrator decide, then change configuration. > (hard for most users to configure, can die if there are problems (e.g. OOM), My suggestion, add to cluster scripts and adjust automatically. >become CPU starved if the server is busy Preallocation table changing is quite fast operation, and with patch https://review.whamcloud.com/34950 , safe and lockless. >needs extra scanning to learn current filesystem state and may become out of sync with the kernel). Scanning is made by kernel. Script use "/proc/fs/ldiskfs/loop1/mb_groups" output. This statistic is perfect data for such decision. Anyway, even in kernel we need use this statistic.

Andreas Dilger added a comment - 24/May/19 8:38 PM

A novel script has been developed to dynamically adjust the block device pre-allocation table. This controls the number of pre-allocated blocks that are created for the request size in logarithmic increments starting at 4. As file systems fragment and become filled, some free block groups will simply not be available. Because of this, the block allocator should be tuned to address this on a regular basis. 

How hard would it be to include this into the mballoc code in the kernel directly? Having a userspace tool is OK, but suffers from a number of limitations (hard for most users to configure, can die if there are problems (e.g. OOM), become CPU starved if the server is busy, needs extra scanning to learn current filesystem state and may become out of sync with the kernel).

Andreas Dilger added a comment - 24/May/19 8:38 PM A novel script has been developed to dynamically adjust the block device pre-allocation table. This controls the number of pre-allocated blocks that are created for the request size in logarithmic increments starting at 4. As file systems fragment and become filled, some free block groups will simply not be available. Because of this, the block allocator should be tuned to address this on a regular basis.  How hard would it be to include this into the mballoc code in the kernel directly? Having a userspace tool is OK, but suffers from a number of limitations (hard for most users to configure, can die if there are problems (e.g. OOM), become CPU starved if the server is busy, needs extra scanning to learn current filesystem state and may become out of sync with the kernel).

People

Assignee:: Artem Blagodarenko (Inactive)

Reporter:: Artem Blagodarenko (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/May/19 7:25 AM

Updated:: 10/Sep/19 2:00 PM

Resolved:: 10/Sep/19 1:45 PM