Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
A race of sorts when starting LNet. The discovery thread is started before the monitor thread, so we may have PUT/GET issued before the monitor thread has initialized its data structures. Results in oops:
[ 297.984001] BUG: unable to handle kernel NULL pointer dereference at (null) [ 297.987525] IP: [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet] [ 297.990999] PGD 0 [ 297.991954] Oops: 0000 [#1] SMP [ 297.993493] Modules linked in: ksocklnd(OE) ptlrpc(OE+) obdclass(OE) lnet(OE) libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) mlx5_ib(OE) ib_ucm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_core(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) mlx4_ib(OE) ib_core(OE) mlx4_en(OE) ptp pps_core mlx4_core(OE) mlx_compat(OE) devlink cirrus ttm drm_kms_helper syscopyarea ppdev sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 pcspkr joydev i2c_core virtio_balloon parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk ata_piix libata serio_raw virtio_pci virtio_ring virtio floppy [ 298.031286] CPU: 1 PID: 11179 Comm: lnet_discovery Tainted: G OE ------------ 3.10.0-693.21.1.x3.2.152.x86_64 #1 [ 298.038606] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 298.042341] task: ffff8800b897cf10 ti: ffff880134cf8000 task.ti: ffff880134cf8000 [ 298.047182] RIP: 0010:[<ffffffffc07db70d>] [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet] [ 298.053925] RSP: 0018:ffff880134cfbcc0 EFLAGS: 00010296 [ 298.057366] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017 [ 298.061963] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800b8b6df80 [ 298.066598] RBP: ffff880134cfbcf0 R08: 00000000c565cd78 R09: 0000000000000000 [ 298.071218] R10: 0000000000000009 R11: fffffffffffffff0 R12: ffff8800b8b6df80 [ 298.075816] R13: ffff8800b8b6df80 R14: 0000004787d10555 R15: ffff880136265060 [ 298.080443] FS: 0000000000000000(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000 [ 298.085693] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 298.089432] CR2: 0000000000000000 CR3: 00000000359c2000 CR4: 00000000000006e0 [ 298.094034] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 298.098707] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 298.103328] Call Trace: [ 298.104940] [<ffffffffc07e2751>] LNetGet+0x5d1/0xa80 [lnet] [ 298.108750] [<ffffffffc07e2e05>] lnet_send_ping+0x1a5/0x1d0 [lnet] [ 298.112811] [<ffffffffc07f24e8>] lnet_peer_send_ping+0xa8/0x250 [lnet] [ 298.117082] [<ffffffffc07f7a20>] lnet_peer_discovery+0x800/0x11a0 [lnet] [ 298.121481] [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [ 298.123330] [<ffffffffc07f7220>] ? lnet_peer_merge_data+0xf90/0xf90 [lnet] [ 298.126429] [<ffffffff810b4031>] kthread+0xd1/0xe0 [ 298.129468] [<ffffffff810c28a7>] ? finish_task_switch+0x57/0x170 [ 298.135321] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [ 298.141100] [<ffffffff816c4577>] ret_from_fork+0x77/0xb0 [ 298.146445] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [ 298.152179] Code: 03 00 e8 17 b2 f7 ff 4d 3b 24 24 74 11 4c 89 e7 e8 79 89 b6 c0 4d 89 24 24 4d 89 64 24 08 48 8b 05 01 8b 03 00 48 63 d3 4c 89 e7 <48> 8b 14 d0 48 8b 72 08 e8 96 88 b6 c0 48 8b 3d 9f 88 03 00 89 [ 298.173797] RIP [<ffffffffc07db70d>] lnet_attach_rsp_tracker.isra.32+0xad/0x200 [lnet] [ 298.182665] RSP <ffff880134cfbcc0> [ 298.187675] CR2: 0000000000000000
The oops was here:
/usr/src/debug/lustre-2.12.0.1_cray_136_g0e9df14/lnet/lnet/lib-move.c: 4623 0xffffffffc07d670d <lnet_attach_rsp_tracker+173>: mov (%rax,%rdx,8),%rdx
Which is:
list_add_tail(&local_rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
And...:
crash_x86_64> the_lnet | grep ln_mt_rstq ln_mt_rstq = 0x0, crash_x86_64>
rc = lnet_peer_discovery_start(); if (rc != 0) goto err_destroy_push_target; rc = lnet_monitor_thr_start(); if (rc != 0) goto err_stop_discovery_thr;
crash_x86_64> ps | grep '>' > 11163 11162 1 ffff880134cd8fd0 RU 0.0 18876 1032 modprobe > 11176 2 0 ffff880135efcf10 RU 0.0 0 0 [lnet_discovery] crash_x86_64>