Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12194

clients getting soft lockups on 2.10.7

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • EL 7.4.1708
    • 3
    • 9223372036854775807

    Description

      Getting occasional soft lockups on 2.10.7 clients

      kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ptlrpcd_01_08:11711]

      Attachments

        Issue Links

          Activity

            [LU-12194] clients getting soft lockups on 2.10.7
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first?

            BTW: you can back to your original version lustre to remove the debug patch.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first? BTW: you can back to your original version lustre to remove the debug patch. Thanks, Yangsheng

            Hi YangSheng,

            I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology?

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology? Kind regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Could you please tell me the status of site? Do you still collect spt_table data?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Could you please tell me the status of site? Do you still collect spt_table data? Thanks, YangSheng
            ys Yang Sheng added a comment - - edited

            Hi, Campbell,

            Yes, I am sorry have typo in my comment. So please test with this pattern to see whether the lockup can be reproduced.

            BTW: The 'options libcfs cpu_npartitions=6' can be removed.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - - edited Hi, Campbell, Yes, I am sorry have typo in my comment. So please test with this pattern to see whether the lockup can be reproduced. BTW: The 'options libcfs cpu_npartitions=6' can be removed. Thanks, YangSheng

            Hi YangSheng,

            Had to modify it slightly to work as it complained:

            May 24 10:38:49 bravo2 kernel: LNetError: 21221:0:(linux-cpu.c:1151:cfs_cpu_init()) Failed to create cptab from pattern '[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]'

            Modified cpu_pattern to have a partition number for the first set, so I have:

            alias ko2iblnd-opa ko2iblnd
            options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
            options libcfs cpu_npartitions=6
            options libcfs cpu_pattern=0[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]
            
            install ko2iblnd /usr/sbin/ko2iblnd-probe
            

            So I get:

            cpu_partition_table=
            0 : 0 2 4 6 8 10
            1 : 12 14 16 18 20 22
            2 : 1 3 5 7 9 11
            3 : 13 15 17 19 21 23

            Which looks like what we want I assume.

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, Had to modify it slightly to work as it complained: May 24 10:38:49 bravo2 kernel: LNetError: 21221:0:(linux-cpu.c:1151:cfs_cpu_init()) Failed to create cptab from pattern ' [0,2,4,6,8,10] 1 [12,14,16,18,20,22] 2 [1,3,5,7,9,11] 3 [13,15,17,19,21,23] ' Modified cpu_pattern to have a partition number for the first set, so I have: alias ko2iblnd-opa ko2iblnd options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 options libcfs cpu_npartitions=6 options libcfs cpu_pattern=0[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23] install ko2iblnd /usr/sbin/ko2iblnd-probe So I get: cpu_partition_table= 0 : 0 2 4 6 8 10 1 : 12 14 16 18 20 22 2 : 1 3 5 7 9 11 3 : 13 15 17 19 21 23 Which looks like what we want I assume.
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I note that you have 2 NUMA nodes. So we need partition explicitly as below:

            options libcfs cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]
            
            

            Or you can use 'modprobe cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]'

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, I note that you have 2 NUMA nodes. So we need partition explicitly as below: options libcfs cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23] Or you can use 'modprobe cpu_pattern= [0,2,4,6,8,10] 1 [12,14,16,18,20,22] 2 [1,3,5,7,9,11] 3 [13,15,17,19,21,23] ' Thanks, YangSheng
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6'
            before mount lustre. So can avoid changing any files.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Please add the "options libcfs cpu_npartitions=6" as a NEW line. Also you can use 'modprobe libcfs cpu_npartitions=6' before mount lustre. So can avoid changing any files. Thanks, YangSheng

            Hi YangSheng,

            I added the modprobe line and reloaded lustre modules, but it is not working:

            May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2
            May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2

            cpu_partition_table=
            0 : 0 2 4 6 8 10 12 14 16 18 20 22
            1 : 1 3 5 7 9 11 13 15 17 19 21 23

            /etc/modprobe.d/ko2iblnd.conf

            alias ko2iblnd-opa ko2iblnd
            options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6
            
            install ko2iblnd /usr/sbin/ko2iblnd-probe
            

            Having a look at what I'm doing wrong

            cmcl Campbell Mcleay (Inactive) added a comment - - edited Hi YangSheng, I added the modprobe line and reloaded lustre modules, but it is not working: May 23 18:39:20 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 May 23 18:46:16 bravo2 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 /etc/modprobe.d/ko2iblnd.conf alias ko2iblnd-opa ko2iblnd options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 libcfs cpu_npartitions=6 install ko2iblnd /usr/sbin/ko2iblnd-probe Having a look at what I'm doing wrong
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Please add this line into /etc/modprobe.d/ko2iblnd.conf.

            options libcfs cpu_npartitions=6
            
            

            And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Please add this line into /etc/modprobe.d/ko2iblnd.conf. options libcfs cpu_npartitions=6 And then reload the lustre modules to verify whether the lockup still be hit. Please ensure it is effective by 'lctl get_param cpu_partition_table'. Thanks, YangSheng

            Hi YangSheng,

            All clients have:

            cpu_partition_table=
            0 : 0 2 4 6 8 10 12 14 16 18 20 22
            1 : 1 3 5 7 9 11 13 15 17 19 21 23

            Regards,
            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, All clients have: cpu_partition_table= 0 : 0 2 4 6 8 10 12 14 16 18 20 22 1 : 1 3 5 7 9 11 13 15 17 19 21 23 Regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Could you please collect data as below:

            # lctl get_param cpu_partition_table
            

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Could you please collect data as below: # lctl get_param cpu_partition_table Thanks, YangSheng

            People

              ys Yang Sheng
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: