[LU-13541] Application hang and kernel "divide error" in ll_readpage Created: 08/May/20 Updated: 21/Aug/20 Resolved: 16/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
While launching Slurm jobs on our cluster, some of jobs hung quite early, in a "dd" command where a dataset is copied to a local ext4 filesystem: ll_file_read_iter+0xa1/0x290 [lustre] new_sync_read+0x122/0x1b0 __vfs_read+0x29/0x40 vfs_read+0x8e/0x130 ksys_read+0xa7/0xe0 It happened on multiple nodes on the cluster. But on some nodes, it works fine. It seems to be correlated to a kernel "divide error" (division by zero?) in the kernel log of those nodes: [ 2171.682001] divide error: 0000 1 SMP NOPTI [ 2171.686888] CPU: 133 PID: 35015 Comm: python Tainted: P OE 5.3.0-24-generic #26~18.04.2-Ubuntu [ 2171.706858] RIP: 0010:ll_readpage+0x25d/0x730 [lustre] [ 2171.802801] Call Trace: [ 2171.805539] filemap_fault+0x9be/0x9f0 [ 2171.830810] ll_fault+0xdb/0x710 [lustre] [ 2171.839869] __do_fault+0x57/0x117 [ 2171.843668] __handle_mm_fault+0xda0/0x1230 [ 2171.848344] handle_mm_fault+0xcb/0x210 [ 2171.852634] __do_page_fault+0x2a1/0x4d0 [ 2171.857018] do_page_fault+0x2c/0xe0 [ 2171.861014] page_fault+0x34/0x40 It seems that some of these errors were caused by these jobs, according to the time. But some of them were not (probably by another unrelated job); but the bad state lingers and block anyone wanting to access this particular file. Other files seem fine, but this file is now poisoned. |
| Comments |
| Comment by Gerrit Updater [ 08/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38545 |
| Comment by Andreas Dilger [ 08/May/20 ] |
|
| Comment by Gerrit Updater [ 16/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38545/ |
| Comment by Peter Jones [ 16/May/20 ] |
|
Landed for 2.14 |