Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12537

recovery-mds-scale failover_mds: BUG: unable to handle kernel NULL pointer dereference at (null) IP: lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet]

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      A race of sorts when starting LNet. The discovery thread is started before the monitor thread, so we may have PUT/GET issued before the monitor thread has initialized its data structures. Results in oops:

      [  297.984001] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  297.987525] IP: [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet]
      [  297.990999] PGD 0 
      [  297.991954] Oops: 0000 [#1] SMP 
      [  297.993493] Modules linked in: ksocklnd(OE) ptlrpc(OE+) obdclass(OE) lnet(OE) libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) mlx5_ib(OE) ib_ucm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_core(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) ptp pps_core mlx4_core(OE) mlx_compat(OE) devlink cirrus ttm drm_kms_helper syscopyarea ppdev sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 pcspkr joydev i2c_core virtio_balloon parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk ata_piix libata serio_raw virtio_pci virtio_ring virtio floppy
      [  298.031286] CPU: 1 PID: 11179 Comm: lnet_discovery Tainted: G           OE  ------------   3.10.0-693.21.1.x3.2.152.x86_64 #1
      [  298.038606] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  298.042341] task: ffff8800b897cf10 ti: ffff880134cf8000 task.ti: ffff880134cf8000
      [  298.047182] RIP: 0010:[<ffffffffc07db70d>]  [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet]
      [  298.053925] RSP: 0018:ffff880134cfbcc0  EFLAGS: 00010296
      [  298.057366] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017
      [  298.061963] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800b8b6df80
      [  298.066598] RBP: ffff880134cfbcf0 R08: 00000000c565cd78 R09: 0000000000000000
      [  298.071218] R10: 0000000000000009 R11: fffffffffffffff0 R12: ffff8800b8b6df80
      [  298.075816] R13: ffff8800b8b6df80 R14: 0000004787d10555 R15: ffff880136265060
      [  298.080443] FS:  0000000000000000(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
      [  298.085693] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  298.089432] CR2: 0000000000000000 CR3: 00000000359c2000 CR4: 00000000000006e0
      [  298.094034] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  298.098707] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  298.103328] Call Trace:
      [  298.104940]  [<ffffffffc07e2751>] LNetGet+0x5d1/0xa80 [lnet]
      [  298.108750]  [<ffffffffc07e2e05>] lnet_send_ping+0x1a5/0x1d0 [lnet]
      [  298.112811]  [<ffffffffc07f24e8>] lnet_peer_send_ping+0xa8/0x250 [lnet]
      [  298.117082]  [<ffffffffc07f7a20>] lnet_peer_discovery+0x800/0x11a0 [lnet]
      [  298.121481]  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
      [  298.123330]  [<ffffffffc07f7220>] ? lnet_peer_merge_data+0xf90/0xf90 [lnet]
      [  298.126429]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [  298.129468]  [<ffffffff810c28a7>] ? finish_task_switch+0x57/0x170
      [  298.135321]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [  298.141100]  [<ffffffff816c4577>] ret_from_fork+0x77/0xb0
      [  298.146445]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [  298.152179] Code: 03 00 e8 17 b2 f7 ff 4d 3b 24 24 74 11 4c 89 e7 e8 79 89 b6 c0 4d 89 24 24 4d 89 64 24 08 48 8b 05 01 8b 03 00 48 63 d3 4c 89 e7 <48> 8b 14 d0 48 8b 72 08 e8 96 88 b6 c0 48 8b 3d 9f 88 03 00 89 
      [  298.173797] RIP  [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet]
      [  298.182665]  RSP <ffff880134cfbcc0>
      [  298.187675] CR2: 0000000000000000
      

      The oops was here:

      /usr/src/debug/lustre-2.12.0.1_cray_136_g0e9df14/lnet/lnet/lib-move.c: 4623
      0xffffffffc07d670d <lnet_attach_rsp_tracker+173>:   mov    (%rax,%rdx,8),%rdx

      Which is:

      list_add_tail(&local_rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);

      And...:

      crash_x86_64> the_lnet | grep ln_mt_rstq
        ln_mt_rstq = 0x0,
      crash_x86_64>
       rc = lnet_peer_discovery_start();
              if (rc != 0)
                      goto err_destroy_push_target;
       
              rc = lnet_monitor_thr_start();
              if (rc != 0)
                      goto err_stop_discovery_thr;
      
      crash_x86_64> ps | grep '>'
      > 11163  11162   1  ffff880134cd8fd0  RU   0.0   18876   1032  modprobe
      > 11176      2   0  ffff880135efcf10  RU   0.0       0      0  [lnet_discovery]
      crash_x86_64>
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: