Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2499

Help debug waiting_locks_callback causing client eviction

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.1.3
    • 2
    • 5857

    Description

      We are seeing the following error.

      Dec 13 08:35:39 nbp2-oss1 kernel: LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 351s: evicting client at 10.151.34.219@o2ib ns: filter-nbp2-OST0018_UUID lock: ffff8804c55d8480/0x1ca7e7e6c780ff4d lrc: 3/0,0 mode: PW/PW res: 182889173/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xd281632991b12020 expref: 6 pid: 19246 timeout 7391670727

      With the client evicted we get dirty_page_discards like this.

      Dec 13 08:35:40 r305i3n1 kernel: [1164772.491928] Lustre: 7178:0:(llite_lib.c:2283:ll_dirty_page_discard_warn()) nbp2: dirty page discard: 10.151.26.5@o2ib:/nbp2/fid: [0x5677ca33040:0x2d5:0x0]//mlellis/RunStilt/runs/20120523-Cherskii-d01-WRF-TEST-20121213.15.46.32.UTC/run_d01/Exe/Copy8/cdump may get corrupted (rc -4)

      We have seen this happen at the beginning of a job. Now we are runing lflush before the start of every job. Could lflush cause this?

      We stilling trying to to reproduce it and gather additional logs.

      Attachments

        Activity

          [LU-2499] Help debug waiting_locks_callback causing client eviction
          pjones Peter Jones added a comment -

          Thanks Mahmoud

          pjones Peter Jones added a comment - Thanks Mahmoud

          this can be closed

          mhanafi Mahmoud Hanafi added a comment - this can be closed
          bobijam Zhenyu Xu added a comment -

          Do you have a detailed log around the time when this issue happens?

          bobijam Zhenyu Xu added a comment - Do you have a detailed log around the time when this issue happens?

          See the scripts directory of this project:

          https://github.com/chaos/lustre-tools-llnl

          It is fairly simple. These days it could be done even shorter if we just used an "lctl set_param".

          We do still use it in the slurm epilog script at the end of every job. We're not seeing that problem. At least not specifically associated with lflush, to the best of my knowledge.

          But "lock callback timer expired" is a very, very common error that we have seen, for many different reasons. Many nodes dropping their locks at the same time could certainly provide the load that uncovers a bug, network problem, or something else. Full logs will be needed to figure out what happened in this case.

          morrone Christopher Morrone (Inactive) added a comment - See the scripts directory of this project: https://github.com/chaos/lustre-tools-llnl It is fairly simple. These days it could be done even shorter if we just used an "lctl set_param". We do still use it in the slurm epilog script at the end of every job. We're not seeing that problem. At least not specifically associated with lflush, to the best of my knowledge. But "lock callback timer expired" is a very, very common error that we have seen, for many different reasons. Many nodes dropping their locks at the same time could certainly provide the load that uncovers a bug, network problem, or something else. Full logs will be needed to figure out what happened in this case.
          pjones Peter Jones added a comment -

          Chris

          I think that you were involved in the creation of lflush. Is LLNL still using this tool on your 2.1.x production system? Have you ever seen any errors of this nature as a result if so?

          Peter

          pjones Peter Jones added a comment - Chris I think that you were involved in the creation of lflush. Is LLNL still using this tool on your 2.1.x production system? Have you ever seen any errors of this nature as a result if so? Peter
          pjones Peter Jones added a comment -

          Bobijam

          lflush is a tool-produced by LLNL. You may find some information on it by Googling. Could you please see what conditions would trigger this error and possible reasons?

          Thanks

          Peter

          pjones Peter Jones added a comment - Bobijam lflush is a tool-produced by LLNL. You may find some information on it by Googling. Could you please see what conditions would trigger this error and possible reasons? Thanks Peter

          People

            bobijam Zhenyu Xu
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: