Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12194

clients getting soft lockups on 2.10.7

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • EL 7.4.1708
    • 3
    • 9223372036854775807

    Description

      Getting occasional soft lockups on 2.10.7 clients

      kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ptlrpcd_01_08:11711]

      Attachments

        Issue Links

          Activity

            [LU-12194] clients getting soft lockups on 2.10.7
            ys Yang Sheng added a comment -

            Hi, Campbell,

            You can refer to document http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.libcfstuning. But we still haven't a detail standard for CPT configuration. Since it is really depend on situation.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, You can refer to document http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.libcfstuning . But we still haven't a detail standard for CPT configuration. Since it is really depend on situation. Thanks, Yangsheng

            Hi YangSheng,

            Are you able to confirm what the general rule is for partitioning?

            Thanks,
            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, Are you able to confirm what the general rule is for partitioning? Thanks, Campbell

            Hi YangSheng,

            What is the general rule for setting cpu_npartitions - is it number of NUMA node cpus divided by no. of NUMA nodes?

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, What is the general rule for setting cpu_npartitions - is it number of NUMA node cpus divided by no. of NUMA nodes? Thanks, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            No, It will set cpt automatically. So we needn't set it by manually. We do it for UMA node. But looks like not on NUMA node.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, No, It will set cpt automatically. So we needn't set it by manually. We do it for UMA node. But looks like not on NUMA node. Thanks, Yangsheng

            Thanks Yangsheng. So the proposed patch will be to modify ko2iblnd.conf?

            cmcl Campbell Mcleay (Inactive) added a comment - Thanks Yangsheng. So the proposed patch will be to modify ko2iblnd.conf?
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first?

            BTW: you can back to your original version lustre to remove the debug patch.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first? BTW: you can back to your original version lustre to remove the debug patch. Thanks, Yangsheng

            Hi YangSheng,

            I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology?

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology? Kind regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Could you please tell me the status of site? Do you still collect spt_table data?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Could you please tell me the status of site? Do you still collect spt_table data? Thanks, YangSheng
            ys Yang Sheng added a comment - - edited

            Hi, Campbell,

            Yes, I am sorry have typo in my comment. So please test with this pattern to see whether the lockup can be reproduced.

            BTW: The 'options libcfs cpu_npartitions=6' can be removed.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - - edited Hi, Campbell, Yes, I am sorry have typo in my comment. So please test with this pattern to see whether the lockup can be reproduced. BTW: The 'options libcfs cpu_npartitions=6' can be removed. Thanks, YangSheng

            Hi YangSheng,

            Had to modify it slightly to work as it complained:

            May 24 10:38:49 bravo2 kernel: LNetError: 21221:0:(linux-cpu.c:1151:cfs_cpu_init()) Failed to create cptab from pattern '[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]'

            Modified cpu_pattern to have a partition number for the first set, so I have:

            alias ko2iblnd-opa ko2iblnd
            options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
            options libcfs cpu_npartitions=6
            options libcfs cpu_pattern=0[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]
            
            install ko2iblnd /usr/sbin/ko2iblnd-probe
            

            So I get:

            cpu_partition_table=
            0 : 0 2 4 6 8 10
            1 : 12 14 16 18 20 22
            2 : 1 3 5 7 9 11
            3 : 13 15 17 19 21 23

            Which looks like what we want I assume.

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, Had to modify it slightly to work as it complained: May 24 10:38:49 bravo2 kernel: LNetError: 21221:0:(linux-cpu.c:1151:cfs_cpu_init()) Failed to create cptab from pattern ' [0,2,4,6,8,10] 1 [12,14,16,18,20,22] 2 [1,3,5,7,9,11] 3 [13,15,17,19,21,23] ' Modified cpu_pattern to have a partition number for the first set, so I have: alias ko2iblnd-opa ko2iblnd options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 options libcfs cpu_npartitions=6 options libcfs cpu_pattern=0[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23] install ko2iblnd /usr/sbin/ko2iblnd-probe So I get: cpu_partition_table= 0 : 0 2 4 6 8 10 1 : 12 14 16 18 20 22 2 : 1 3 5 7 9 11 3 : 13 15 17 19 21 23 Which looks like what we want I assume.
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I note that you have 2 NUMA nodes. So we need partition explicitly as below:

            options libcfs cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]
            
            

            Or you can use 'modprobe cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23]'

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, I note that you have 2 NUMA nodes. So we need partition explicitly as below: options libcfs cpu_pattern=[0,2,4,6,8,10]1[12,14,16,18,20,22]2[1,3,5,7,9,11]3[13,15,17,19,21,23] Or you can use 'modprobe cpu_pattern= [0,2,4,6,8,10] 1 [12,14,16,18,20,22] 2 [1,3,5,7,9,11] 3 [13,15,17,19,21,23] ' Thanks, YangSheng

            People

              ys Yang Sheng
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: