Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12194

clients getting soft lockups on 2.10.7

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • EL 7.4.1708
    • 3
    • 9223372036854775807

    Description

      Getting occasional soft lockups on 2.10.7 clients

      kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ptlrpcd_01_08:11711]

      Attachments

        Issue Links

          Activity

            [LU-12194] clients getting soft lockups on 2.10.7
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6'
            before mount lustre. So can avoid changing any files.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6' before mount lustre. So can avoid changing any files. Thanks, YangSheng

            Hi YangSheng,

            I added the modprobe line and reloaded lustre modules, but it is not working:

            May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2
            May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2

            cpu_partition_table=
            0 : 0 2 4 6 8 10 12 14 16 18 20 22
            1 : 1 3 5 7 9 11 13 15 17 19 21 23

            /etc/modprobe.d/ko2iblnd.conf

            alias ko2iblnd-opa ko2iblnd
            options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6
            
            install ko2iblnd /usr/sbin/ko2iblnd-probe
            

            Having a look at what I'm doing wrong

            cmcl Campbell Mcleay (Inactive) added a comment - - edited Hi YangSheng, I added the modprobe line and reloaded lustre modules, but it is not working: May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 /etc/modprobe.d/ko2iblnd.conf alias ko2iblnd-opa ko2iblnd options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6 install ko2iblnd /usr/sbin/ko2iblnd-probe Having a look at what I'm doing wrong
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Please add this line into /etc/modprobe.d/ko2iblnd.conf.

            options libcfs cpu_npartitions=6
            
            

            And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Please add this line into /etc/modprobe.d/ko2iblnd.conf. options libcfs cpu_npartitions=6 And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'. Thanks, YangSheng

            Hi YangSheng,

            All clients have:

            cpu_partition_table=
            0 : 0 2 4 6 8 10 12 14 16 18 20 22
            1 : 1 3 5 7 9 11 13 15 17 19 21 23

            Regards,
            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, All clients have: cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 Regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Could you please collect data as below:

            # lctl get_param cpu_partition_table
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Could you please collect data as below: # lctl get_param cpu_partition_table Thanks, YangSheng

            Latest one (only one cpu core locked up):

            LNetEQAlloc:000:0,0,0,0,0,0,0,0,0,0:3:
            LNetEQAlloc:001:0,0,0,0,0,0,0,0,0,0:0:
            LNetEQAlloc:002:0,0,0,0,0,0,0,0,0,0:0:
            LNetMEAttach:000:0,0,0,0,0,0,0,0,0,0:0:
            LNetMEAttach:001:1863,90,7,0,0,0,0,0,0,0:94076607:
            LNetMEAttach:002:8331,102,44,0,0,0,0,0,0,0:455990518:
            LNetMDAttach:000:0,0,0,0,0,0,0,0,0,0:0:
            LNetMDAttach:001:2874,67,14,0,0,0,0,0,0,0:94076607:
            LNetMDAttach:002:1490,93,62,0,0,0,0,0,0,0:455990518:
            LNetSetLazyPortal:000:0,0,0,0,0,0,0,0,0,0:1:
            LNetSetLazyPortal:001:0,0,0,0,0,0,0,0,0,0:0:
            LNetSetLazyPortal:002:0,0,0,0,0,0,0,0,0,0:0:
            lnet_res_lock_current:000:0,0,0,0,0,0,0,0,0,0:0:
            lnet_res_lock_current:001:0,0,0,0,0,0,0,0,0,0:208589267:
            lnet_res_lock_current:002:0,0,0,0,0,0,0,0,0,0:337491708:
            LNetPut:000:0,0,0,0,0,0,0,0,0,0:0:
            LNetPut:001:903,904,904,904,213,32,61,62,1,0:208589267:
            LNetPut:002:2042,3861,3862,16575,483,40,56,56,8,0:337491708:
            lnet_finalize:000:0,0,0,0,0,0,0,0,0,0:0:
            lnet_finalize:001:17604,1020,23,0,0,0,0,0,0,0:301706190:
            lnet_finalize:002:18562,110,55,0,0,0,0,0,0,0:789653414:
            lnet_ptl_match_md:000:0,0,0,0,0,0,0,0,0,0:0:
            lnet_ptl_match_md:001:113272,1277,1160,707,0,0,0,0,0,0:94578970:
            lnet_ptl_match_md:002:24143,1081,799,52,0,0,0,0,0,0:459115756:
            LNetMDUnlink:000:0,0,0,0,0,0,0,0,0,0:0:
            LNetMDUnlink:001:6202,11217,15551,15551,319,67,68,69,69,46:93967533:
            LNetMDUnlink:002:212311,212312,212314,212314,173,100,101,101,77,80:446618837:
            lnet_ptl_match_delay:000:0,0,0,0,0,0,0,0,0,0:0:
            lnet_ptl_match_delay:001:62,1,0,0,0,0,0,0,0,0:89980:
            lnet_ptl_match_delay:002:36,20,0,0,0,0,0,0,0,0:47777:

            cmcl Campbell Mcleay (Inactive) added a comment - Latest one (only one cpu core locked up): LNetEQAlloc:000:0,0,0,0,0,0,0,0,0,0:3: LNetEQAlloc:001:0,0,0,0,0,0,0,0,0,0:0: LNetEQAlloc:002:0,0,0,0,0,0,0,0,0,0:0: LNetMEAttach:000:0,0,0,0,0,0,0,0,0,0:0: LNetMEAttach:001:1863,90,7,0,0,0,0,0,0,0:94076607: LNetMEAttach:002:8331,102,44,0,0,0,0,0,0,0:455990518: LNetMDAttach:000:0,0,0,0,0,0,0,0,0,0:0: LNetMDAttach:001:2874,67,14,0,0,0,0,0,0,0:94076607: LNetMDAttach:002:1490,93,62,0,0,0,0,0,0,0:455990518: LNetSetLazyPortal:000:0,0,0,0,0,0,0,0,0,0:1: LNetSetLazyPortal:001:0,0,0,0,0,0,0,0,0,0:0: LNetSetLazyPortal:002:0,0,0,0,0,0,0,0,0,0:0: lnet_res_lock_current:000:0,0,0,0,0,0,0,0,0,0:0: lnet_res_lock_current:001:0,0,0,0,0,0,0,0,0,0:208589267: lnet_res_lock_current:002:0,0,0,0,0,0,0,0,0,0:337491708: LNetPut:000:0,0,0,0,0,0,0,0,0,0:0: LNetPut:001:903,904,904,904,213,32,61,62,1,0:208589267: LNetPut:002:2042,3861,3862,16575,483,40,56,56,8,0:337491708: lnet_finalize:000:0,0,0,0,0,0,0,0,0,0:0: lnet_finalize:001:17604,1020,23,0,0,0,0,0,0,0:301706190: lnet_finalize:002:18562,110,55,0,0,0,0,0,0,0:789653414: lnet_ptl_match_md:000:0,0,0,0,0,0,0,0,0,0:0: lnet_ptl_match_md:001:113272,1277,1160,707,0,0,0,0,0,0:94578970: lnet_ptl_match_md:002:24143,1081,799,52,0,0,0,0,0,0:459115756: LNetMDUnlink:000:0,0,0,0,0,0,0,0,0,0:0: LNetMDUnlink:001:6202,11217,15551,15551,319,67,68,69,69,46:93967533: LNetMDUnlink:002:212311,212312,212314,212314,173,100,101,101,77,80:446618837: lnet_ptl_match_delay:000:0,0,0,0,0,0,0,0,0,0:0: lnet_ptl_match_delay:001:62,1,0,0,0,0,0,0,0,0:89980: lnet_ptl_match_delay:002:36,20,0,0,0,0,0,0,0,0:47777:
            ys Yang Sheng added a comment -

            Hi, Campbell,

            --Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected?
            No, Please collect data after lockup without 'rmmod'.

            --should I keep sending you data after lockups or do you have enough to work with for now?
            Yes, please send the data after every lockup.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, --Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? No, Please collect data after lockup without 'rmmod'. --should I keep sending you data after lockups or do you have enough to work with for now? Yes, please send the data after every lockup. Thanks, YangSheng

            Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? And should I keep sending you data after lockups or do you have enough to work with for now?

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Should I use 'lustre_rmmod' and then 'modprobe lustre' after a lockup is detected? And should I keep sending you data after lockups or do you have enough to work with for now? Kind regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Just latest data is enough. Except you reload the lnet module after lockup. From the log, Looks like the delay is not so high.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Just latest data is enough. Except you reload the lnet module after lockup. From the log, Looks like the delay is not so high. Thanks, YangSheng

            Hi YangSheng,

            As it is collecting spt_table data when there are lockups, I assume that it is showing the maximum hold time of the lock on the cpu - or have I got that wrong? Should I just gather the data every minute? Please let me know what periods you will need.

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, As it is collecting spt_table data when there are lockups, I assume that it is showing the maximum hold time of the lock on the cpu - or have I got that wrong? Should I just gather the data every minute? Please let me know what periods you will need. Kind regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Since the patch want to gather maximum hold time of cpt lock. So the latest the better.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, Since the patch want to gather maximum hold time of cpt lock. So the latest the better. Thanks, Yangsheng

            People

              ys Yang Sheng
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: