Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.17.0
Affects Version/s: Lustre 2.14.0, Lustre 2.16.1
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running "lfs migrate" or "lfs mirror resync" on a file that is much larger than RAM can cause the thread to livelock in the kernel under memory pressure, causing "lfs" to spin with 100% CPU usage trying to access buffer pages to do the migrate/mirror copy:

[<0>] do_swap_page+0xaf/0x790
[<0>] __handle_mm_fault+0x552/0x6d0
[<0>] handle_mm_fault+0xca/0x2a0
[<0>] __get_user_pages+0x250/0x830
[<0>] get_user_pages_unlocked+0xd5/0x2a0
[<0>] internal_get_user_pages_fast+0x193/0x2c0
[<0>] iov_iter_get_pages_alloc+0x110/0x4c0
[<0>] ll_direct_IO_impl+0x30f/0xc50 [lustre]
[<0>] generic_file_read_iter+0x8f/0x150
[<0>] vvp_io_read_start+0x597/0x840 [lustre]
[<0>] cl_io_start+0x5d/0x110 [obdclass]
[<0>] cl_io_loop+0x9a/0x200 [obdclass]
[<0>] ll_file_io_generic+0xa83/0xf90 [lustre]
[<0>] ll_file_read_iter+0x9de/0xd20 [lustre]
[<0>] new_sync_read+0x10f/0x160
[<0>] vfs_read+0x91/0x150
[<0>] ksys_pread64+0x65/0xa0
[<0>] do_syscall_64+0x5b/0x1a0

This can happen if run on the OSS where an object being read or written is loading pages into cache, or if there are another process(es) (e.g. "lfs mirror extend" calling mirror_extend_file() that does not open files with O_DIRECT) that are reading into the client page cache.

It appears that the buffer pages used by migrate_copy_data() for both migrate and resync get swapped out under pressure and cannot be faulted in by the kernel.

This was easily and repeatedly reproduced on a client-on-OSS node running el8.10 4.18.0-553.50.1 kernel with 4GB RAM migrating a 30GB file, but also on a standalone client with 128GB RAM running 40 copies of "lfs mirror extend" (buffered IO) and "lfs mirror resync" (direct IO) on separate files of course, with some of the files over 2TB.

Attachments

Issue Links

is cloned by

LU-19267 OSS readcache page allocations can deadlock during local IO

Open

is related to

LU-14043 lfs mirror extend need not use O_DIRECT on source

Resolved

is related to

LU-19147 don't try to cache large objects

In Progress

Activity

People

Assignee:: Andreas Dilger

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 25/Jun/25 4:45 PM

Updated:: 25/Sep/25 3:02 PM

Resolved:: 25/Sep/25 3:02 PM