[LU-12010] qsd_upd_thread use after free on server shutdown Created: 25/Feb/19  Updated: 25/Feb/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This crash happens at time on server shutdown for me:

[170816.014563] Lustre: server umount lustre-MDT0000 complete
[170820.769998] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
[170820.770106] Lustre: lustre-MDT0001-osp-MDT0002: Connection to lustre-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[170820.774258] Lustre: Skipped 3 previous similar messages
[170820.775154] Lustre: lustre-MDT0001: Not available for connect from 0@lo (stopping)
[170820.775163] Lustre: Skipped 4 previous similar messages
[170820.788095] LustreError: Skipped 2 previous similar messages
[170821.231402] LustreError: 26613:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8802eab93b40 x1626148381397504/t0(0) o41->lustre-MDT0002-osp-MDT0001@0@lo:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[170821.235073] LustreError: 26613:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[170822.774970] Lustre: 26620:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550816084/real 1550816084]  req@ffff8802fa0f7b40 x1626148381397248/t0(0) o400->MGC192.168.123.169@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1550816091 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[170822.781474] LustreError: 166-1: MGC192.168.123.169@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
[170823.045059] BUG: unable to handle kernel paging request at ffff8802b8f5ddcc
[170823.045059] IP: [<ffffffff813fb685>] do_raw_spin_lock+0x5/0xa0
[170823.045059] PGD 241b067 PUD 33ebfa067 PMD 33ea32067 PTE 80000002b8f5d060
[170823.055931] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[170823.055931] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_piix4 virtio_balloon virtio_console ip_tables rpcsec_gss_krb5 ata_generic drm_kms_helper pata_acpi ttm crct10dif_pclmul drm ata_piix drm_panel_orientation_quirks crct10dif_common serio_raw crc32c_intel i2c_core libata virtio_blk floppy [last unloaded: libcfs]
[170823.055931] 
[170823.055931] CPU: 7 PID: 27695 Comm: lquota_wb_lustr Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.6-debug #1
[170823.055931] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[170823.055931] task: ffff8802c3b30c40 ti: ffff8802aa504000 task.ti: ffff8802aa504000
[170823.055931] RIP: 0010:[<ffffffff813fb685>]  [<ffffffff813fb685>] do_raw_spin_lock+0x5/0xa0
[170823.055931] RSP: 0018:ffff8802aa507d70  EFLAGS: 00010002
[170823.055931] RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
[170823.055931] RDX: ffff8802aa507fd8 RSI: 0000000000000003 RDI: ffff8802b8f5ddc8
[170823.055931] RBP: ffff8802aa507d78 R08: 0000000000000010 R09: ffff8802daa98700
[170823.055931] R10: 0000000000000000 R11: 000000000000000f R12: ffff8802b8f5dc00
[170823.055931] R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
[170823.055931] FS:  0000000000000000(0000) GS:ffff88033dbc0000(0000) knlGS:0000000000000000
[170823.055931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[170823.055931] CR2: ffff8802b8f5ddcc CR3: 0000000001c10000 CR4: 00000000001607e0
[170823.055931] Call Trace:
[170823.055931]  [<ffffffff817b9e30>] _raw_spin_lock_irqsave+0x30/0x40
[170823.055931]  [<ffffffff810c2db3>] __wake_up+0x23/0x50
[170823.055931]  [<ffffffffa0acfbb2>] qsd_upd_thread+0xb02/0xc90 [lquota]
[170823.055931]  [<ffffffff817b6cd0>] ? __schedule+0x410/0xa00
[170823.055931]  [<ffffffff810caae0>] ? wake_up_state+0x20/0x20
[170823.055931]  [<ffffffffa0acf0b0>] ? qsd_upd_add+0xf0/0xf0 [lquota]
[170823.055931]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[170823.055931]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[170823.055931]  [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21
[170823.055931]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140

The call is in

0x13be0 is in qsd_upd_thread (/home/green/git/lustre-release/lustre/quota/qsd_writeback.c:512).
507				qsd_start_reint_thread(qsd->qsd_type_array[qtype]);
508		}
509		lu_env_fini(env);
510		OBD_FREE_PTR(env);
511		thread_set_flags(thread, SVC_STOPPED);
512		wake_up(&thread->t_ctl_waitq);

so by the time we got there, the thread structure was already freed, clearly a race.


Generated at Sat Feb 10 02:48:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.