Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6607

MDS ( 2 node DNE) running out of memory and crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.7.0
    • 4
    • 9223372036854775807

    Description

      2 node DNE MDS
      16 OSS
      2K clients

      A MDS node randomly running out of memory and hang.
      We watch MDS drain its memory in matter of few minutes. Many times right after recovery from previous hangs.

      Clients are generating a ton of Lustre errors with strings "ptlrpc_expire_one_request". The numbers are from several hundred thousands to several millions of such errors from each node. Here are number of error counts from some nodes:

      comet-12-31 662616
      comet-10-06 690764
      comet-12-24 720396
      comet-12-25 735659
      comet-12-14 778073
      comet-12-33 840302
      comet-10-10 928322
      comet-12-33 945614
      comet-12-25 992288
      comet-10-15 1131711
      comet-12-25 1147043
      comet-10-07 1160876
      comet-12-30 1180270
      comet-10-03 1387072
      comet-10-02 2515764
      comet-10-02 3371128

      I am attaching logs from both client and server on one such incidence.

      Attachments

        1. dmesg_mds.gz
          21 kB
        2. lustre-log.tgz
          9.35 MB
        3. messages-19-6.gz
          92 kB
        4. clients_log.gz
          622 kB
        5. dmesg.out
          396 kB
        6. slabinfo.txt
          27 kB

        Activity

          People

            laisiyao Lai Siyao
            haisong Haisong Cai (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: