[LU-8057] o2iblnd driver is causing memory corruption due to improper handling of scatter list. Created: 22/Apr/16 Updated: 19/Mar/19 Resolved: 15/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Any installation running Lustre on top of a infiniband stack. |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
A bug was discovered in the upstream kernel in the handling of the scatter list, tx->tx_frag, in the o2iblnd driver. So the fix of using sg_next was introduced but it revealed a serious bug in that when all 256 pages allocated for fragments are used and the data is at an offset that an extra random page of memory is stomped on. |
| Comments |
| Comment by Oleg Drokin [ 22/Apr/16 ] |
|
Can you please link to the actual details of the problems here? |
| Comment by James A Simmons [ 22/Apr/16 ] |
|
I created a patch that has the back ported patch as well as a possible fix. Its at http://review.whamcloud.com/#/c/19342. The original patch was sent to Greg directly so its not in the driver-devel mailing archives. Using the upstream staging tree you can inspect this change at git commit 3d1477309806459d39e13d8c3206ba35d183c34a |
| Comment by Andreas Dilger [ 25/Apr/16 ] |
|
James, could you clarify if the memory corruption issue is related to the use of sg_next(), or if it also existed with the sg++ implementation but was just harder to detect? That will affect which Lustre versions this patch needs to be backported to. |
| Comment by Peter Jones [ 25/Apr/16 ] |
|
Doug Could you please review this patch? Thanks Peter |
| Comment by James A Simmons [ 25/Apr/16 ] |
|
The memory corruption exist with the current implementation using sg+. If the offset is not zero and if all 256 pages for the fragments are used then an extra random page gets stomped on. So what we have now made it harder to detect since it was a silent corruption. In the upstream client when sg+ was replaced by sg = sg_next() there was no extra random page you could stomp on so I was seeing failures in my testing. Let me duplicate the failures and post them here. |
| Comment by Gerrit Updater [ 14/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19342/ |
| Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ] |
|
patch has landed to master for 2.9.0 |