Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

While launching Slurm jobs on our cluster, some of jobs hung quite early, in a "dd" command where a dataset is copied to a local ext4 filesystem:

ll_file_read_iter+0xa1/0x290 [lustre]
new_sync_read+0x122/0x1b0
__vfs_read+0x29/0x40
vfs_read+0x8e/0x130
ksys_read+0xa7/0xe0

It happened on multiple nodes on the cluster. But on some nodes, it works fine. It seems to be correlated to a kernel "divide error" (division by zero?) in the kernel log of those nodes:

[ 2171.682001] divide error: 0000 1 SMP NOPTI
[ 2171.686888] CPU: 133 PID: 35015 Comm: python Tainted: P OE 5.3.0-24-generic #26~18.04.2-Ubuntu
[ 2171.706858] RIP: 0010:ll_readpage+0x25d/0x730 [lustre]
[ 2171.802801] Call Trace:
[ 2171.805539] filemap_fault+0x9be/0x9f0
[ 2171.830810] ll_fault+0xdb/0x710 [lustre]
[ 2171.839869] __do_fault+0x57/0x117
[ 2171.843668] __handle_mm_fault+0xda0/0x1230
[ 2171.848344] handle_mm_fault+0xcb/0x210
[ 2171.852634] __do_page_fault+0x2a1/0x4d0
[ 2171.857018] do_page_fault+0x2c/0xe0
[ 2171.861014] page_fault+0x34/0x40

It seems that some of these errors were caused by these jobs, according to the time. But some of them were not (probably by another unrelated job); but the bad state lingers and block anyone wanting to access this particular file. Other files seem fine, but this file is now poisoned.

is related to

LU-12644 correct fast read & strided readahead interaction

Resolved

Assignee:: Wang Shilong (Inactive)

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 08/May/20 9:48 PM

Updated:: 21/Aug/20 8:40 PM

Resolved:: 16/May/20 1:55 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates