Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11895

CPU lockup in LNetMDUnlink

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0
    • CentOS 7.6, kernel 3.10.0-957.1.3.el7_lustre.x86_64, MOFED 4.5
    • 3
    • 9223372036854775807

    Description

      This just occurred on a Lustre 2.12.0 client, running Robinhood, many cores with 2 x AMD EPYC 7401 so 96 cpu threads total.

      [1488057.711176] CPU: 63 PID: 54246 Comm: ptlrpcd_07_06 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
      [1488057.711177] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
      [1488057.711178] task: ffff8abafc522080 ti: ffff8abaf4f54000 task.ti: ffff8abaf4f54000
      [1488057.711182] RIP: 0010:[<ffffffff897121e6>]  [<ffffffff897121e6>] native_queued_spin_lock_slowpath+0x126/0x200
      [1488057.711183] RSP: 0018:ffff8abaf4f57b78  EFLAGS: 00000246
      [1488057.711183] RAX: 0000000000000000 RBX: ffffffffc0ff5a97 RCX: 0000000001f90000
      [1488057.711184] RDX: ffff8a9affa9b780 RSI: 0000000000910000 RDI: ffff8adafa7b1b00
      [1488057.711185] RBP: ffff8abaf4f57b78 R08: ffff8afb3f9db780 R09: 0000000000000000
      [1488057.711185] R10: 0000000000000000 R11: 000000000000000f R12: ffff8aed569e2a00
      [1488057.711186] R13: 0005c4aa17c51860 R14: ffff8aed24f64b00 R15: 0000000000000007
      [1488057.711187] FS:  00007ea1d389b700(0000) GS:ffff8afb3f9c0000(0000) knlGS:0000000000000000
      [1488057.711188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1488057.711189] CR2: 00007ecf3e8c7ff6 CR3: 000000276ae10000 CR4: 00000000003407e0
      [1488057.711190] Call Trace:
      [1488057.711192]  [<ffffffff89d5bfcb>] queued_spin_lock_slowpath+0xb/0xf
      [1488057.711193]  [<ffffffff89d6a480>] _raw_spin_lock+0x20/0x30
      [1488057.711201]  [<ffffffffc0c23418>] cfs_percpt_lock+0x58/0x110 [libcfs]
      [1488057.711211]  [<ffffffffc0c889d8>] LNetMDUnlink+0x78/0x180 [lnet]
      [1488057.711250]  [<ffffffffc0f276bf>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
      [1488057.711287]  [<ffffffffc0f2c35e>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
      [1488057.711324]  [<ffffffffc0f2c83f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
      [1488057.711362]  [<ffffffffc0f5cc5c>] ptlrpcd+0x28c/0x550 [ptlrpc]
      [1488057.711364]  [<ffffffff896d67b0>] ? wake_up_state+0x20/0x20
      [1488057.711402]  [<ffffffffc0f5c9d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
      [1488057.711404]  [<ffffffff896c1c31>] kthread+0xd1/0xe0
      [1488057.711406]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
      [1488057.711408]  [<ffffffff89d74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1488057.711410]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
      

      A lot of CPU were stucks in LNetMDUnlink. Server crashed with hard lockup at the end. vmcore available on demand. Attached vmcore-dmesg.txt

      I kept the default lru_size/lru_max_age 0/3900000 values on this server, so I'll try to reduce them as follow:

      • lru_size=100
      • lru_max_age=1200
        like on our Lustre 2.10 robinhood server for Oak.

      Any other recommendation welcomed.

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: