[LU-8057] o2iblnd driver is causing memory corruption due to improper handling of scatter list. Created: 22/Apr/16  Updated: 19/Mar/19  Resolved: 15/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Any installation running Lustre on top of a infiniband stack.


Issue Links:
Related
is related to LU-4423 Tracking of patches from upstream ker... Resolved
is related to LU-8715 Regression from LU-8057 causes loadin... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A bug was discovered in the upstream kernel in the handling of the scatter list, tx->tx_frag, in the o2iblnd driver. So the fix of using sg_next was introduced but it revealed a serious bug in that when all 256 pages allocated for fragments are used and the data is at an offset that an extra random page of memory is stomped on.



 Comments   
Comment by Oleg Drokin [ 22/Apr/16 ]

Can you please link to the actual details of the problems here?

Comment by James A Simmons [ 22/Apr/16 ]

I created a patch that has the back ported patch as well as a possible fix. Its at http://review.whamcloud.com/#/c/19342. The original patch was sent to Greg directly so its not in the driver-devel mailing archives. Using the upstream staging tree you can inspect this change at git commit 3d1477309806459d39e13d8c3206ba35d183c34a

Comment by Andreas Dilger [ 25/Apr/16 ]

James, could you clarify if the memory corruption issue is related to the use of sg_next(), or if it also existed with the sg++ implementation but was just harder to detect? That will affect which Lustre versions this patch needs to be backported to.

Comment by Peter Jones [ 25/Apr/16 ]

Doug

Could you please review this patch?

Thanks

Peter

Comment by James A Simmons [ 25/Apr/16 ]

The memory corruption exist with the current implementation using sg+. If the offset is not zero and if all 256 pages for the fragments are used then an extra random page gets stomped on. So what we have now made it harder to detect since it was a silent corruption. In the upstream client when sg+ was replaced by sg = sg_next() there was no extra random page you could stomp on so I was seeing failures in my testing. Let me duplicate the failures and post them here.

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19342/
Subject: LU-8057 ko2iblnd: Replace sg++ with sg = sg_next(sg)
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d226464acaacccd240da43dcc22372fbf8cb04a6

Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ]

patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:14:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.