Details

    • Task
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.0.0
    • RHEL 5.5
      2* MDS servers, 6*OSS servers handling 3OST each (HW config: Dual Intel Westmere processor with 6 cores each, 24GB Memory)
      no. of clients: 368
    • 10203

    Description

      Hi ,

      We have lustre 2.0 at our setup with 2 mds servers and 6 oss servers.

      Facing an issue where "lfs check servers" output varies from node to node.

      Please find below the outputs of two different clients took at the same time.

      Request your help in solving this issue.

      Also find attached var log messags of the node having this error.

      [root <at> cn367 ~]# lfs check servers

      scratch-OST0011-osc-ffff810c01a9c000 active.

      [root <at> mgmt00 ~]# lfs check servers

      error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable

      Thanks & Regards,
      N.Chakravarthy.

      Attachments

        1. cn363-computenode_messages.rar
          65 kB
        2. HOME_MESSAGES.rar
          957 kB
        3. performance.zip
          61 kB
        4. SCRATCH_MESSAGES.rar
          1.42 MB

        Activity

          [LU-828] Lustre Client Unstable

          Since you are an unsupported customer, the only thing I can suggest is that you upgrade to the latest Lustre 2.1.0 release to determine if this is fixing your problem. To get Whamcloud support for your system, please contact info@whamcloud.com for more information.

          adilger Andreas Dilger added a comment - Since you are an unsupported customer, the only thing I can suggest is that you upgrade to the latest Lustre 2.1.0 release to determine if this is fixing your problem. To get Whamcloud support for your system, please contact info@whamcloud.com for more information.

          Cliif,

          Appreciate your suggestions on this, since we are in a bad shape... Please do the needful.

          prabhu.chakra Chakravarthy N added a comment - Cliif, Appreciate your suggestions on this, since we are in a bad shape... Please do the needful.

          Cliff,

          Just an update on this issue...

          We've taken the downtime of the entire lustre and "ls, du" everthing started working..

          To my understanding the recovery, open files and caching in the client has solved the issue.

          Could you please suggest some permanant solution for this issue like clearing the cache, close open files automatically without doing lfsck?

          Appreciate your early help on this..

          prabhu.chakra Chakravarthy N added a comment - Cliff, Just an update on this issue... We've taken the downtime of the entire lustre and "ls, du" everthing started working.. To my understanding the recovery, open files and caching in the client has solved the issue. Could you please suggest some permanant solution for this issue like clearing the cache, close open files automatically without doing lfsck? Appreciate your early help on this..

          If your infiniband network is dropping packets, that would cause this issue.

          cliffw Cliff White (Inactive) added a comment - If your infiniband network is dropping packets, that would cause this issue.
          ravibadiger24 Raghavendra Badiger added a comment - - edited

          Hi Cliff,

          I have uploaded following 3 archive files having Syslog messages of client node where these problem is noticed, and MGS,MDS,OSS server nodes messages for both /home /scratch lustre filesystems. Could you please quickly glance through the logs to see what is the cause for the issue and unexpected symptoms (like Resource temporarily unavailable and ls,cat cmd hangs etc) reported here.

          cn363-computenode_messages.rar
          HOME_MESSAGES.rar
          SCRATCH_MESSAGES.rar

          [root <at> cn367 ~]# lfs check servers
          scratch-OST0011-osc-ffff810c01a9c000 active.

          [root <at> mgmt00 ~]# lfs check servers
          error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable

          Additional symptoms are,
          When we try to ls dir /scratch/qgr/R8 dir it stucks. And resource become unavialbale.
          We tried with ls --color=none , in this case we are able to list the files.
          But when we do ls -l --color=none it again stucks.
          we are not able to cat the file on which ls -l stucks others are working fine.

          Thanks & Regards
          -Raghu

          ravibadiger24 Raghavendra Badiger added a comment - - edited Hi Cliff, I have uploaded following 3 archive files having Syslog messages of client node where these problem is noticed, and MGS,MDS,OSS server nodes messages for both /home /scratch lustre filesystems. Could you please quickly glance through the logs to see what is the cause for the issue and unexpected symptoms (like Resource temporarily unavailable and ls,cat cmd hangs etc) reported here. cn363-computenode_messages.rar HOME_MESSAGES.rar SCRATCH_MESSAGES.rar [root <at> cn367 ~] # lfs check servers scratch-OST0011-osc-ffff810c01a9c000 active. [root <at> mgmt00 ~] # lfs check servers error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable Additional symptoms are, When we try to ls dir /scratch/qgr/R8 dir it stucks. And resource become unavialbale. We tried with ls --color=none , in this case we are able to list the files. But when we do ls -l --color=none it again stucks. we are not able to cat the file on which ls -l stucks others are working fine. Thanks & Regards -Raghu

          Syslog messages of lustre MGS,MDS,OSS Server nodes for /scratch lustre filesystem

          ravibadiger24 Raghavendra Badiger added a comment - Syslog messages of lustre MGS,MDS,OSS Server nodes for /scratch lustre filesystem

          Syslog messages of lustre MGS,MDS,OSS Server nodes for /home lustre filesystem

          ravibadiger24 Raghavendra Badiger added a comment - Syslog messages of lustre MGS,MDS,OSS Server nodes for /home lustre filesystem

          Sample compute node syslog messages where issue is noticed

          ravibadiger24 Raghavendra Badiger added a comment - Sample compute node syslog messages where issue is noticed

          Appreciate your help ASAP, since the entire production is on toss...

          Please do the needful.

          prabhu.chakra Chakravarthy N added a comment - Appreciate your help ASAP, since the entire production is on toss... Please do the needful.

          Just wanted to update on this issue, whenever we do "ls" on the lustre fs then only it's showing this error. We've checked the same with the non problematic node as well the issue remains...

          Please suggest...

          prabhu.chakra Chakravarthy N added a comment - Just wanted to update on this issue, whenever we do "ls" on the lustre fs then only it's showing this error. We've checked the same with the non problematic node as well the issue remains... Please suggest...

          People

            wc-triage WC Triage
            prabhu.chakra Chakravarthy N
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: