[LU-8715] Regression from LU-8057 causes loading of fld.ko hung in 2.7.2 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
lustre server nas-2.7.2-3nasS running in centos 6.7.

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Since our nas-2.7.2-2nas rebased to b2_7_fe to nas-2.7.2-3nas, we found loading lustre module fld.ko hanged. Modprobe took 100% cpu time and could not be killed.

I identified the culprit of the problem using git bisect:
commit f23e22da88f07e95071ec76807aaa42ecd39e8ca
Author: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Date: Thu Jun 16 23:12:03 2016 +0800

~~LU-8057~~ ko2iblnd: Replace sg++ with sg = sg_next(sg)

It was a b2_7_fe back port from the following one:
Lustre-commit: d226464acaacccd240da43dcc22372fbf8cb04a6
Lustre-change: http://review.whamcloud.com/19342

Attachments

Issue Links

is related to

LU-8057 o2iblnd driver is causing memory corruption due to improper handling of scatter list.

Resolved

Activity

[LU-8715] Regression from LU-8057 causes loading of fld.ko hung in 2.7.2

Jay Lan (Inactive) added a comment - 25/Oct/16 7:04 PM

Doug,

You wrote in previous commet:
" If this needs to be fixed quickly, then removing ~~LU-8085~~ from your build will be the best approach. As ORNL needs ~~LU-8085~~, we cannot remove it from master."

Did you actually mean to write "~~LU-8057~~"? We do not have ~~LU-8057~~ in our git repo...

Jay Lan (Inactive) added a comment - 25/Oct/16 7:04 PM Doug, You wrote in previous commet: " If this needs to be fixed quickly, then removing LU-8085 from your build will be the best approach. As ORNL needs LU-8085 , we cannot remove it from master." Did you actually mean to write " LU-8057 "? We do not have LU-8057 in our git repo...

Doug Oucharek (Inactive) added a comment - 24/Oct/16 6:27 PM

As best we can figure out, the change in ~~LU-8057~~ has causes a little more memory to be used per connection. That is pushing your system over the edge.

As James has indicated, a proper fix will be to change how to allocate memory in LNet. That is going to take some time to get right as the potential to break all of LNet is pretty good.

I don't believe the fix for ~~LU-8057~~ is critical for your setup. If this needs to be fixed quickly, then removing ~~LU-8085~~ from your build will be the best approach. As ORNL needs ~~LU-8085~~, we cannot remove it from master.

Doug Oucharek (Inactive) added a comment - 24/Oct/16 6:27 PM As best we can figure out, the change in LU-8057 has causes a little more memory to be used per connection. That is pushing your system over the edge. As James has indicated, a proper fix will be to change how to allocate memory in LNet. That is going to take some time to get right as the potential to break all of LNet is pretty good. I don't believe the fix for LU-8057 is critical for your setup. If this needs to be fixed quickly, then removing LU-8085 from your build will be the best approach. As ORNL needs LU-8085 , we cannot remove it from master.

Mahmoud Hanafi added a comment - 23/Oct/16 3:15 AM

Any updates?

Mahmoud Hanafi added a comment - 23/Oct/16 3:15 AM Any updates?

James A Simmons added a comment - 18/Oct/16 9:49 PM

Why not. The problem is the LIBCFS_ALLOC and FREE macros. Looking at the macros gave me a headache so no patch from me. I need to get into the right mental state to tackle it

James A Simmons added a comment - 18/Oct/16 9:49 PM Why not. The problem is the LIBCFS_ALLOC and FREE macros. Looking at the macros gave me a headache so no patch from me. I need to get into the right mental state to tackle it

Doug Oucharek (Inactive) added a comment - 18/Oct/16 9:09 PM

James, can we do that fix under this ticket?

Doug Oucharek (Inactive) added a comment - 18/Oct/16 9:09 PM James, can we do that fix under this ticket?

James A Simmons added a comment - 18/Oct/16 7:31 PM

I know exactly what your problem is. We saw this problem in the lustre core some time ago and changed the OBD_ALLOC macros. The libcfs/LNet layer uses it own LIBCFS_ALLOC macros which means when the allocations are more than 2 pages in size they hit the vmalloc spinlock serialization issue. We need a fix for libcfs much like lustre had.

James A Simmons added a comment - 18/Oct/16 7:31 PM I know exactly what your problem is. We saw this problem in the lustre core some time ago and changed the OBD_ALLOC macros. The libcfs/LNet layer uses it own LIBCFS_ALLOC macros which means when the allocations are more than 2 pages in size they hit the vmalloc spinlock serialization issue. We need a fix for libcfs much like lustre had.

Mahmoud Hanafi added a comment - 18/Oct/16 7:17 PM

perf top showed during module load all the time is spent in __vmalloc_node.

Samples: 748K of event 'cycles', Event count (approx.): 53812402443
Overhead  Shared Object            Symbol
  96.21%  [kernel]                 [k] __vmalloc_node
   0.91%  [kernel]                 [k] read_hpet
   0.28%  [kernel]                 [k] get_vmalloc_info
   0.26%  [kernel]                 [k] __write_lock_failed
   0.25%  [kernel]                 [k] __read_lock_failed
   0.05%  [kernel]                 [k] apic_timer_interrupt
   0.05%  [kernel]                 [k] _spin_lock
   0.04%  perf                     [.] dso__find_symbol
   0.03%  [kernel]                 [k] find_busiest_group
   0.03%  [kernel]                 [k] clear_page_c
   0.03%  [kernel]                 [k] page_fault
   0.03%  [kernel]                 [k] memset
   0.02%  [kernel]                 [k] rcu_process_gp_end
   0.02%  perf                     [.] perf_evsel__parse_sample
   0.02%  [kernel]                 [k] sha_transform
   0.02%  [kernel]                 [k] native_write_msr_safe

Mahmoud Hanafi added a comment - 18/Oct/16 7:17 PM perf top showed during module load all the time is spent in __vmalloc_node. Samples: 748K of event 'cycles' , Event count (approx.): 53812402443 Overhead Shared Object Symbol 96.21% [kernel] [k] __vmalloc_node 0.91% [kernel] [k] read_hpet 0.28% [kernel] [k] get_vmalloc_info 0.26% [kernel] [k] __write_lock_failed 0.25% [kernel] [k] __read_lock_failed 0.05% [kernel] [k] apic_timer_interrupt 0.05% [kernel] [k] _spin_lock 0.04% perf [.] dso__find_symbol 0.03% [kernel] [k] find_busiest_group 0.03% [kernel] [k] clear_page_c 0.03% [kernel] [k] page_fault 0.03% [kernel] [k] memset 0.02% [kernel] [k] rcu_process_gp_end 0.02% perf [.] perf_evsel__parse_sample 0.02% [kernel] [k] sha_transform 0.02% [kernel] [k] native_write_msr_safe

Mahmoud Hanafi added a comment - 18/Oct/16 6:27 PM

we have >12,000 clients. We do see some servers consume all the credits.

Mahmoud Hanafi added a comment - 18/Oct/16 6:27 PM we have >12,000 clients. We do see some servers consume all the credits.

Jay Lan (Inactive) added a comment - 18/Oct/16 6:23 PM

@Bruno Faccini: Yes, I can reproduce the problem on our freshly rebooted lustre servers by doing 'modprobe fld.'

Jay Lan (Inactive) added a comment - 18/Oct/16 6:23 PM @Bruno Faccini: Yes, I can reproduce the problem on our freshly rebooted lustre servers by doing 'modprobe fld.'

Joseph Gmitter (Inactive) added a comment - 18/Oct/16 5:14 PM

Hi Doug,

Can you please have a look into the issue since it relates to the ~~LU-8057~~ change?

Thanks.
Joe

Joseph Gmitter (Inactive) added a comment - 18/Oct/16 5:14 PM Hi Doug, Can you please have a look into the issue since it relates to the LU-8057 change? Thanks. Joe

James A Simmons added a comment - 18/Oct/16 4:38 PM - edited

The fix is correct and it fixes a real bug. What this change did is exposed another problem in the ko2iblnd driver. I have to ask is your system really consuming all those credits? I don't think the IB driver queue pair depth is big enough to handle all those credits.

James A Simmons added a comment - 18/Oct/16 4:38 PM - edited The fix is correct and it fixes a real bug. What this change did is exposed another problem in the ko2iblnd driver. I have to ask is your system really consuming all those credits? I don't think the IB driver queue pair depth is big enough to handle all those credits.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 18/Oct/16 1:34 AM

Updated:: 18/Apr/18 6:07 PM

Resolved:: 18/Apr/18 6:07 PM