Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17207

LNet: race between monitor thread stopping and discovery thread PUSH may cause a crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This crash showed up in Janitor testing for https://review.whamcloud.com/#/c/fs/lustre-release/+/52522/ (LU-17103):

      Crash with latest lustre function lnet_attach_rsp_tracker in backtrace called here: BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
      PGD 0
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) veth crc32_generic crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [last unloaded: libcfs]
      CPU: 3 PID: 8711 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
      task: ffff8800aa96d550 ti: ffff8800b4348000 task.ti: ffff8800b4348000
      RIP: 0010:[<ffffffffa01e3acc>]  [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
      RSP: 0018:ffff8800b434bcb8  EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017
      RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b639d380
      RBP: ffff8800b434bce8 R08: 00000000d5555b28 R09: 632e65766f6d2d62
      R10: 0000000000000180 R11: ffff8800b434bb86 R12: ffff8800b639d380
      R13: ffff8800b639d380 R14: 0000000000000000 R15: 0000007175e68a80
      FS:  0000000000000000(0000) GS:ffff88013e380000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 00000000b6238000 CR4: 00000000000006e0
      Call Trace:
       [<ffffffffa01eaf8a>] LNetPut+0x29a/0x9c0 [lnet]
       [<ffffffffa01fd7bc>] lnet_peer_send_push+0x2ec/0x440 [lnet]
       [<ffffffffa0206760>] ? lnet_discovery_event_reply+0xc70/0xc70 [lnet]
       [<ffffffffa0208488>] lnet_peer_discovery+0x4a8/0x1710 [lnet]
       [<ffffffff817e8dce>] ? _raw_spin_unlock_irq+0xe/0x30
       [<ffffffff817e60fa>] ? __schedule+0x32a/0x7d0
       [<ffffffff810bb2a0>] ? wake_up_atomic_t+0x30/0x30
       [<ffffffffa0207fe0>] ? lnet_peer_merge_data+0x1230/0x1230 [lnet]
       [<ffffffff810ba114>] kthread+0xe4/0xf0
       [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
       [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140

      As pointed out by Chris Horn, this may be the result of discovery thread issuing a push while the monitor thread is stopping:

      "Monitor thread is stopping and the ln_mt_resendqs are freed. Discovery then wakes and tries to issue push which attempts to dereference ln_mt_resendqs"

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: