[LU-12194] clients getting soft lockups on 2.10.7 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.7
Labels:
None
Environment:
EL 7.4.1708

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Getting occasional soft lockups on 2.10.7 clients

kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ptlrpcd_01_08:11711]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bravo2-soft-lockups.gz
136 kB
02/May/19 10:27 AM
spt-table-data-bravo4
11 kB
21/May/19 5:47 PM

Issue Links

is related to

LU-12667 Read doesn't perform well in complex NUMA configuration

Open

is related to

LU-11895 CPU lockup in LNetMDUnlink

Open

Activity

[LU-12194] clients getting soft lockups on 2.10.7

Yang Sheng added a comment - 24/May/19 1:26 AM

Hi, Campbell,

Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6'
before mount lustre. So can avoid changing any files.

Thanks,
YangSheng

Yang Sheng added a comment - 24/May/19 1:26 AM Hi, Campbell, Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6' before mount lustre. So can avoid changing any files. Thanks, YangSheng

Campbell Mcleay (Inactive) added a comment - 23/May/19 5:50 PM - edited

Hi YangSheng,

I added the modprobe line and reloaded lustre modules, but it is not working:

May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2
May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2

cpu_partition_table=
0 : 0 2 4 6 8 10 12 14 16 18 20 22
1 : 1 3 5 7 9 11 13 15 17 19 21 23

/etc/modprobe.d/ko2iblnd.conf

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6

install ko2iblnd /usr/sbin/ko2iblnd-probe

Having a look at what I'm doing wrong

Campbell Mcleay (Inactive) added a comment - 23/May/19 5:50 PM - edited Hi YangSheng, I added the modprobe line and reloaded lustre modules, but it is not working: May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 /etc/modprobe.d/ko2iblnd.conf alias ko2iblnd-opa ko2iblnd options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6 install ko2iblnd /usr/sbin/ko2iblnd-probe Having a look at what I'm doing wrong

Yang Sheng added a comment - 23/May/19 4:57 PM

Hi, Campbell,

Please add this line into /etc/modprobe.d/ko2iblnd.conf.

options libcfs cpu_npartitions=6

And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'.

Thanks,
YangSheng

Yang Sheng added a comment - 23/May/19 4:57 PM Hi, Campbell, Please add this line into /etc/modprobe.d/ko2iblnd.conf. options libcfs cpu_npartitions=6 And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'. Thanks, YangSheng

Campbell Mcleay (Inactive) added a comment - 23/May/19 11:18 AM

Hi YangSheng,

All clients have:

cpu_partition_table=
0 : 0 2 4 6 8 10 12 14 16 18 20 22
1 : 1 3 5 7 9 11 13 15 17 19 21 23

Regards,
Campbell

Campbell Mcleay (Inactive) added a comment - 23/May/19 11:18 AM Hi YangSheng, All clients have: cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 Regards, Campbell

Yang Sheng added a comment - 23/May/19 2:45 AM

Hi, Campbell,

Could you please collect data as below:

# lctl get_param cpu_partition_table

Thanks,
YangSheng

Yang Sheng added a comment - 23/May/19 2:45 AM Hi, Campbell, Could you please collect data as below: # lctl get_param cpu_partition_table Thanks, YangSheng

Campbell Mcleay (Inactive) added a comment - 22/May/19 2:18 PM

Latest one (only one cpu core locked up):

LNetEQAlloc:000:0,0,0,0,0,0,0,0,0,0:3:
LNetEQAlloc:001:0,0,0,0,0,0,0,0,0,0:0:
LNetEQAlloc:002:0,0,0,0,0,0,0,0,0,0:0:
LNetMEAttach:000:0,0,0,0,0,0,0,0,0,0:0:
LNetMEAttach:001:1863,90,7,0,0,0,0,0,0,0:94076607:
LNetMEAttach:002:8331,102,44,0,0,0,0,0,0,0:455990518:
LNetMDAttach:000:0,0,0,0,0,0,0,0,0,0:0:
LNetMDAttach:001:2874,67,14,0,0,0,0,0,0,0:94076607:
LNetMDAttach:002:1490,93,62,0,0,0,0,0,0,0:455990518:
LNetSetLazyPortal:000:0,0,0,0,0,0,0,0,0,0:1:
LNetSetLazyPortal:001:0,0,0,0,0,0,0,0,0,0:0:
LNetSetLazyPortal:002:0,0,0,0,0,0,0,0,0,0:0:
lnet_res_lock_current:000:0,0,0,0,0,0,0,0,0,0:0:
lnet_res_lock_current:001:0,0,0,0,0,0,0,0,0,0:208589267:
lnet_res_lock_current:002:0,0,0,0,0,0,0,0,0,0:337491708:
LNetPut:000:0,0,0,0,0,0,0,0,0,0:0:
LNetPut:001:903,904,904,904,213,32,61,62,1,0:208589267:
LNetPut:002:2042,3861,3862,16575,483,40,56,56,8,0:337491708:
lnet_finalize:000:0,0,0,0,0,0,0,0,0,0:0:
lnet_finalize:001:17604,1020,23,0,0,0,0,0,0,0:301706190:
lnet_finalize:002:18562,110,55,0,0,0,0,0,0,0:789653414:
lnet_ptl_match_md:000:0,0,0,0,0,0,0,0,0,0:0:
lnet_ptl_match_md:001:113272,1277,1160,707,0,0,0,0,0,0:94578970:
lnet_ptl_match_md:002:24143,1081,799,52,0,0,0,0,0,0:459115756:
LNetMDUnlink:000:0,0,0,0,0,0,0,0,0,0:0:
LNetMDUnlink:001:6202,11217,15551,15551,319,67,68,69,69,46:93967533:
LNetMDUnlink:002:212311,212312,212314,212314,173,100,101,101,77,80:446618837:
lnet_ptl_match_delay:000:0,0,0,0,0,0,0,0,0,0:0:
lnet_ptl_match_delay:001:62,1,0,0,0,0,0,0,0,0:89980:
lnet_ptl_match_delay:002:36,20,0,0,0,0,0,0,0,0:47777:

Campbell Mcleay (Inactive) added a comment - 22/May/19 2:18 PM Latest one (only one cpu core locked up): LNetEQAlloc:000:0,0,0,0,0,0,0,0,0,0:3: LNetEQAlloc:001:0,0,0,0,0,0,0,0,0,0:0: LNetEQAlloc:002:0,0,0,0,0,0,0,0,0,0:0: LNetMEAttach:000:0,0,0,0,0,0,0,0,0,0:0: LNetMEAttach:001:1863,90,7,0,0,0,0,0,0,0:94076607: LNetMEAttach:002:8331,102,44,0,0,0,0,0,0,0:455990518: LNetMDAttach:000:0,0,0,0,0,0,0,0,0,0:0: LNetMDAttach:001:2874,67,14,0,0,0,0,0,0,0:94076607: LNetMDAttach:002:1490,93,62,0,0,0,0,0,0,0:455990518: LNetSetLazyPortal:000:0,0,0,0,0,0,0,0,0,0:1: LNetSetLazyPortal:001:0,0,0,0,0,0,0,0,0,0:0: LNetSetLazyPortal:002:0,0,0,0,0,0,0,0,0,0:0: lnet_res_lock_current:000:0,0,0,0,0,0,0,0,0,0:0: lnet_res_lock_current:001:0,0,0,0,0,0,0,0,0,0:208589267: lnet_res_lock_current:002:0,0,0,0,0,0,0,0,0,0:337491708: LNetPut:000:0,0,0,0,0,0,0,0,0,0:0: LNetPut:001:903,904,904,904,213,32,61,62,1,0:208589267: LNetPut:002:2042,3861,3862,16575,483,40,56,56,8,0:337491708: lnet_finalize:000:0,0,0,0,0,0,0,0,0,0:0: lnet_finalize:001:17604,1020,23,0,0,0,0,0,0,0:301706190: lnet_finalize:002:18562,110,55,0,0,0,0,0,0,0:789653414: lnet_ptl_match_md:000:0,0,0,0,0,0,0,0,0,0:0: lnet_ptl_match_md:001:113272,1277,1160,707,0,0,0,0,0,0:94578970: lnet_ptl_match_md:002:24143,1081,799,52,0,0,0,0,0,0:459115756: LNetMDUnlink:000:0,0,0,0,0,0,0,0,0,0:0: LNetMDUnlink:001:6202,11217,15551,15551,319,67,68,69,69,46:93967533: LNetMDUnlink:002:212311,212312,212314,212314,173,100,101,101,77,80:446618837: lnet_ptl_match_delay:000:0,0,0,0,0,0,0,0,0,0:0: lnet_ptl_match_delay:001:62,1,0,0,0,0,0,0,0,0:89980: lnet_ptl_match_delay:002:36,20,0,0,0,0,0,0,0,0:47777:

Yang Sheng added a comment - 22/May/19 12:35 PM

Hi, Campbell,

--Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected?
No, Please collect data after lockup without 'rmmod'.

--should I keep sending you data after lockups or do you have enough to work with for now?
Yes, please send the data after every lockup.

Thanks,
YangSheng

Yang Sheng added a comment - 22/May/19 12:35 PM Hi, Campbell, --Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? No, Please collect data after lockup without 'rmmod'. --should I keep sending you data after lockups or do you have enough to work with for now? Yes, please send the data after every lockup. Thanks, YangSheng

Campbell Mcleay (Inactive) added a comment - 22/May/19 11:52 AM

Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? And should I keep sending you data after lockups or do you have enough to work with for now?

Kind regards,

Campbell

Campbell Mcleay (Inactive) added a comment - 22/May/19 11:52 AM Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? And should I keep sending you data after lockups or do you have enough to work with for now? Kind regards, Campbell

Yang Sheng added a comment - 22/May/19 9:51 AM

Hi, Campbell,

Just latest data is enough. Except you reload the lnet module after lockup. From the log, Looks like the delay is not so high.

Thanks,
YangSheng

Yang Sheng added a comment - 22/May/19 9:51 AM Hi, Campbell, Just latest data is enough. Except you reload the lnet module after lockup. From the log, Looks like the delay is not so high. Thanks, YangSheng

Campbell Mcleay (Inactive) added a comment - 22/May/19 9:30 AM

Hi YangSheng,

As it is collecting spt_table data when there are lockups, I assume that it is showing the maximum hold time of the lock on the cpu - or have I got that wrong? Should I just gather the data every minute? Please let me know what periods you will need.

Kind regards,

Campbell

Campbell Mcleay (Inactive) added a comment - 22/May/19 9:30 AM Hi YangSheng, As it is collecting spt_table data when there are lockups, I assume that it is showing the maximum hold time of the lock on the cpu - or have I got that wrong? Should I just gather the data every minute? Please let me know what periods you will need. Kind regards, Campbell

Yang Sheng added a comment - 22/May/19 3:21 AM

Hi， Campbell,

Since the patch want to gather maximum hold time of cpt lock. So the latest the better.

Thanks,
Yangsheng

Yang Sheng added a comment - 22/May/19 3:21 AM Hi， Campbell, Since the patch want to gather maximum hold time of cpt lock. So the latest the better. Thanks, Yangsheng

People

Assignee:: Yang Sheng

Reporter:: Campbell Mcleay (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 18/Apr/19 3:02 PM

Updated:: 10/Dec/20 6:47 PM