Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
While launching Slurm jobs on our cluster, some of jobs hung quite early, in a "dd" command where a dataset is copied to a local ext4 filesystem:
ll_file_read_iter+0xa1/0x290 [lustre] new_sync_read+0x122/0x1b0 __vfs_read+0x29/0x40 vfs_read+0x8e/0x130 ksys_read+0xa7/0xe0
It happened on multiple nodes on the cluster. But on some nodes, it works fine. It seems to be correlated to a kernel "divide error" (division by zero?) in the kernel log of those nodes:
[ 2171.682001] divide error: 0000 1 SMP NOPTI [ 2171.686888] CPU: 133 PID: 35015 Comm: python Tainted: P OE 5.3.0-24-generic #26~18.04.2-Ubuntu [ 2171.706858] RIP: 0010:ll_readpage+0x25d/0x730 [lustre] [ 2171.802801] Call Trace: [ 2171.805539] filemap_fault+0x9be/0x9f0 [ 2171.830810] ll_fault+0xdb/0x710 [lustre] [ 2171.839869] __do_fault+0x57/0x117 [ 2171.843668] __handle_mm_fault+0xda0/0x1230 [ 2171.848344] handle_mm_fault+0xcb/0x210 [ 2171.852634] __do_page_fault+0x2a1/0x4d0 [ 2171.857018] do_page_fault+0x2c/0xe0 [ 2171.861014] page_fault+0x34/0x40
It seems that some of these errors were caused by these jobs, according to the time. But some of them were not (probably by another unrelated job); but the bad state lingers and block anyone wanting to access this particular file. Other files seem fine, but this file is now poisoned.
Attachments
Issue Links
- is related to
-
LU-12644 correct fast read & strided readahead interaction
-
- Resolved
-