Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9817

Multi-Rail Crash on message free

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      [ 3773.614054] BUG: unable to handle kernel paging request at ffff88007d283e58
      [ 3773.614736] IP: [<ffffffffa02a6e41>] lnet_return_tx_credits_locked+0x1f1/0x480 [lnet]
      [ 3773.615880] PGD 2e75067 PUD bcc1a067 PMD bca30067 PTE 800000007d283060
      [ 3773.616481] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 3773.617034] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) loop mbcache jbd2 rpcsec_gss_krb5 ata_generic syscopyarea pata_acpi sysfillrect sysimgblt ttm drm_kms_helper i2c_piix4 drm ata_piix serio_raw virtio_blk pcspkr i2c_core libata virtio_console virtio_balloon floppy nfsd ip_tables [last unloaded: libcfs]
      [ 3773.621922] CPU: 0 PID: 28885 Comm: socknal_sd00_01 Tainted: G           OE  ------------   3.10.0-debug #1
      [ 3773.622981] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 3773.623523] task: ffff880082b3aa00 ti: ffff880073cd0000 task.ti: ffff880073cd0000
      [ 3773.624614] RIP: 0010:[<ffffffffa02a6e41>]  [<ffffffffa02a6e41>] lnet_return_tx_credits_locked+0x1f1/0x480 [lnet]
      [ 3773.625715] RSP: 0018:ffff880073cd3cd0  EFLAGS: 00010282
      [ 3773.626390] RAX: 0000000000000000 RBX: ffff88006433ee00 RCX: 000000000d6a0d68
      [ 3773.626961] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006671b280
      [ 3773.627526] RBP: ffff880073cd3d00 R08: 0000000000000000 R09: 0000000000000000
      [ 3773.628099] R10: 0000000000000000 R11: ffff880082b3b2d8 R12: ffff88006a839e00
      [ 3773.628664] R13: ffff880073c1ee00 R14: ffff88006a839ea8 R15: ffff88007d283e10
      [ 3773.629237] FS:  0000000000000000(0000) GS:ffff8800bc600000(0000) knlGS:0000000000000000
      [ 3773.643338] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 3773.643946] CR2: ffff88007d283e58 CR3: 000000002907b000 CR4: 00000000000006f0
      [ 3773.644521] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3773.645100] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 3773.645674] Stack:
      [ 3773.646158]  ffff88007d283e00 ffff88006433ee00 0000000000000000 ffff880020a2cf80
      [ 3773.647203]  0000000000000000 ffff88006433ee10 ffff880073cd3d30 ffffffffa02995db
      [ 3773.648244]  ffff88006433ee00 0000000000000000 ffff880020a2cf80 ffff880020a2cf88
      [ 3773.649287] Call Trace:
      [ 3773.649789]  [<ffffffffa02995db>] lnet_msg_decommit+0xeb/0x700 [lnet]
      [ 3773.651054]  [<ffffffffa0299f89>] lnet_finalize+0x1e9/0x690 [lnet]
      [ 3773.651612]  [<ffffffffa01c8fe5>] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
      [ 3773.652215]  [<ffffffffa01cdc64>] ksocknal_scheduler+0x234/0x680 [ksocklnd]
      [ 3773.652784]  [<ffffffff810a4090>] ? wake_up_atomic_t+0x30/0x30
      [ 3773.653408]  [<ffffffffa01cda30>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
      [ 3773.654011]  [<ffffffff810a2eda>] kthread+0xea/0xf0
      [ 3773.654551]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140
      [ 3773.655145]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
      [ 3773.655684]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140
      [ 3773.656497] Code: 01 0f 84 1d 02 00 00 0f b7 43 58 89 c1 66 41 33 4f 48 66 f7 c1 fe ff 0f 85 d5 00 00 00 48 8b 7d d0 be 01 00 00 00 e8 6f cc ff ff <41> 0f b7 47 48 89 c2 66 33 53 58 66 f7 c2 fe ff 0f 84 7c fe ff
      [ 3773.658705] RIP  [<ffffffffa02a6e41>] lnet_return_tx_credits_locked+0x1f1/0x480 [lnet]
      [ 3773.659774]  RSP <ffff880073cd3cd0>
      [ 3773.660318] CR2: ffff88007d283e58
      

      Suspect code

      1050 »·······»·······»·······if (msg2->msg_tx_cpt != msg->msg_tx_cpt) {
      1051 »·······»·······»·······»·······lnet_net_unlock(msg->msg_tx_cpt);
      1052 »·······»·······»·······»·······lnet_net_lock(msg2->msg_tx_cpt);
      1053 »·······»·······»·······}
      1054                         (void) lnet_post_send_locked(msg2, 1);
      1055 »·······»·······»·······if (msg2->msg_tx_cpt != msg->msg_tx_cpt) {
      1056 »·······»·······»·······»·······lnet_net_unlock(msg2->msg_tx_cpt);
      1057 »·······»·······»·······»·······lnet_net_lock(msg->msg_tx_cpt);
      1058 »·······»·······»·······}
      

      lnet_finalize() could've been called on msg2 resulting in it being freed. Subsequent access is illegal.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: