Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6607

MDS ( 2 node DNE) running out of memory and crash

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.7.0
    • 4
    • 9223372036854775807

    Description

      2 node DNE MDS
      16 OSS
      2K clients

      A MDS node randomly running out of memory and hang.
      We watch MDS drain its memory in matter of few minutes. Many times right after recovery from previous hangs.

      Clients are generating a ton of Lustre errors with strings "ptlrpc_expire_one_request". The numbers are from several hundred thousands to several millions of such errors from each node. Here are number of error counts from some nodes:

      comet-12-31 662616
      comet-10-06 690764
      comet-12-24 720396
      comet-12-25 735659
      comet-12-14 778073
      comet-12-33 840302
      comet-10-10 928322
      comet-12-33 945614
      comet-12-25 992288
      comet-10-15 1131711
      comet-12-25 1147043
      comet-10-07 1160876
      comet-12-30 1180270
      comet-10-03 1387072
      comet-10-02 2515764
      comet-10-02 3371128

      I am attaching logs from both client and server on one such incidence.

      Attachments

        1. clients_log.gz
          622 kB
          Haisong Cai
        2. dmesg_mds.gz
          21 kB
          Haisong Cai
        3. dmesg.out
          396 kB
          Haisong Cai
        4. lustre-log.tgz
          9.35 MB
          Haisong Cai
        5. messages-19-6.gz
          92 kB
          Haisong Cai
        6. slabinfo.txt
          27 kB
          Haisong Cai

        Activity

          [LU-6607] MDS ( 2 node DNE) running out of memory and crash
          pjones Peter Jones made changes -
          Resolution New: Won't Fix [ 2 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          End date New: 04/Sep/15
          Start date New: 15/May/15
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-10 [ JFC-10 ]
          jfc John Fuchs-Chesney (Inactive) made changes -
          Link New: This issue is related to JFC-10 [ JFC-10 ]
          haisong Haisong Cai (Inactive) made changes -
          Attachment New: dmesg.out [ 18831 ]
          Attachment New: slabinfo.txt [ 18832 ]
          laisiyao Lai Siyao made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Lai Siyao [ laisiyao ]
          mdiep Minh Diep made changes -
          Labels New: sdsc
          haisong Haisong Cai (Inactive) created issue -

          People

            laisiyao Lai Siyao
            haisong Haisong Cai (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: