[LU-13541] Application hang and kernel "divide error" in ll_readpage Created: 08/May/20  Updated: 21/Aug/20  Resolved: 16/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-12644 correct fast read & strided readahead... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While launching Slurm jobs on our cluster, some of jobs hung quite early, in a "dd" command where a dataset is copied to a local ext4 filesystem:

ll_file_read_iter+0xa1/0x290 [lustre]
new_sync_read+0x122/0x1b0
__vfs_read+0x29/0x40
vfs_read+0x8e/0x130
ksys_read+0xa7/0xe0

It happened on multiple nodes on the cluster. But on some nodes, it works fine. It seems to be correlated to a kernel "divide error" (division by zero?) in the kernel log of those nodes:

[ 2171.682001] divide error: 0000 1 SMP NOPTI
[ 2171.686888] CPU: 133 PID: 35015 Comm: python Tainted: P OE 5.3.0-24-generic #26~18.04.2-Ubuntu
[ 2171.706858] RIP: 0010:ll_readpage+0x25d/0x730 [lustre]
[ 2171.802801] Call Trace:
[ 2171.805539] filemap_fault+0x9be/0x9f0
[ 2171.830810] ll_fault+0xdb/0x710 [lustre]
[ 2171.839869] __do_fault+0x57/0x117
[ 2171.843668] __handle_mm_fault+0xda0/0x1230
[ 2171.848344] handle_mm_fault+0xcb/0x210
[ 2171.852634] __do_page_fault+0x2a1/0x4d0
[ 2171.857018] do_page_fault+0x2c/0xe0
[ 2171.861014] page_fault+0x34/0x40

It seems that some of these errors were caused by these jobs, according to the time. But some of them were not (probably by another unrelated job); but the bad state lingers and block anyone wanting to access this particular file. Other files seem fine, but this file is now poisoned.



 Comments   
Comment by Gerrit Updater [ 08/May/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38545
Subject: LU-13541 llite: fix possible divide zero in ll_use_fast_io()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb54a8ac674f2ed4de3cbc1f5183bab2b43d5dd6

Comment by Andreas Dilger [ 08/May/20 ]
(gdb) x/i ll_readpage+0x25d
0x38a2d <ll_readpage+605>: idiv %rcx

(gdb) disas /s ll_readpage

1524 skip_pages = (ras->ras_stride_length +
0x0000000000038a20 <+592>: mov 0x50(%r10),%rax
0x0000000000038a24 <+596>: add %rcx,%rax

1525 ras->ras_stride_bytes - 1) / ras->ras_stride_bytes;
0x0000000000038a27 <+599>: sub $0x1,%rax

1524 skip_pages = (ras->ras_stride_length +
0x0000000000038a2b <+603>: cqto
0x0000000000038a2d <+605>: idiv %rcx

So, I guess "ll_use_fast_io" was inlined and ras->ras_stride_bytes == 0?

Comment by Gerrit Updater [ 16/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38545/
Subject: LU-13541 llite: fix possible divide zero in ll_use_fast_io()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7cd0afe583211a11cfe3c1041e5b982e65769f37

Comment by Peter Jones [ 16/May/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.