Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12194

clients getting soft lockups on 2.10.7

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • EL 7.4.1708
    • 3
    • 9223372036854775807

    Description

      Getting occasional soft lockups on 2.10.7 clients

      kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [ptlrpcd_01_08:11711]

      Attachments

        Issue Links

          Activity

            [LU-12194] clients getting soft lockups on 2.10.7
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I am testing the patch in our test cluster. Yes, I think it will be landed in next few months. You can setup it via cpu_pattern before that.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, I am testing the patch in our test cluster. Yes, I think it will be landed in next few months. You can setup it via cpu_pattern before that. Thanks, YangSheng

            Hi YangSheng,

            I didn't see a patch in the 2.10.7 -> 2.10.8 changelog that will set NUMA topology - you mentioned it may take some time to get this patched - do you think it may get done within the next few months? I'm just wondering whether to wait for the patches and upgrade.

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, I didn't see a patch in the 2.10.7 -> 2.10.8 changelog that will set NUMA topology - you mentioned it may take some time to get this patched - do you think it may get done within the next few months? I'm just wondering whether to wait for the patches and upgrade. Thanks, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            You can refer to document http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.libcfstuning. But we still haven't a detail standard for CPT configuration. Since it is really depend on situation.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, You can refer to document http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.libcfstuning . But we still haven't a detail standard for CPT configuration. Since it is really depend on situation. Thanks, Yangsheng

            Hi YangSheng,

            Are you able to confirm what the general rule is for partitioning?

            Thanks,
            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, Are you able to confirm what the general rule is for partitioning? Thanks, Campbell

            Hi YangSheng,

            What is the general rule for setting cpu_npartitions - is it number of NUMA node cpus divided by no. of NUMA nodes?

            Thanks,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, What is the general rule for setting cpu_npartitions - is it number of NUMA node cpus divided by no. of NUMA nodes? Thanks, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            No, It will set cpt automatically. So we needn't set it by manually. We do it for UMA node. But looks like not on NUMA node.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, No, It will set cpt automatically. So we needn't set it by manually. We do it for UMA node. But looks like not on NUMA node. Thanks, Yangsheng

            Thanks Yangsheng. So the proposed patch will be to modify ko2iblnd.conf?

            cmcl Campbell Mcleay (Inactive) added a comment - Thanks Yangsheng. So the proposed patch will be to modify ko2iblnd.conf?
            ys Yang Sheng added a comment -

            Hi, Campbell,

            I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first?

            BTW: you can back to your original version lustre to remove the debug patch.

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Campbell, I think you can apply this change to all of clients that might be impacted by this issue. I'll try to push a patch to make this change more easy. But i think it could take a long time. So can we close this one first? BTW: you can back to your original version lustre to remove the debug patch. Thanks, Yangsheng

            Hi YangSheng,

            I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology?

            Kind regards,

            Campbell

            cmcl Campbell Mcleay (Inactive) added a comment - Hi YangSheng, I'm not collecting spt_table_data at the moment, but I also haven't seen any soft lockups since the changes were made. So what next from here? Do I just add these options to all clients on 2.10.7? Or is there a patch imminent to prevent the issue with the default CPU topology? Kind regards, Campbell
            ys Yang Sheng added a comment -

            Hi, Campbell,

            Could you please tell me the status of site? Do you still collect spt_table data?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Campbell, Could you please tell me the status of site? Do you still collect spt_table data? Thanks, YangSheng

            People

              ys Yang Sheng
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: