Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19008

refcount_t: addition on 0; use-after-free in mdt_hsm_cdt_stop

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.17.0
    • Lustre 2.17.0
    • None
    • 3
    • 9223372036854775807

    Description

      Looks like mdt_hsm_cdt_stop stops some khread too many times?

       [27200.621224] Lustre: DEBUG MARKER: == insanity test 2: Second Failure Mode: MDS/OST Sat May 10 10:33:11 PM UTC 2025 ========================================================== 22:33:11 (1746916391)
      [27222.571920] Lustre: 4857:0:(client.c:2445:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1746916396/real 1746916396]  req@ffff974aa10f9d40 x1831746590615040/t0(0) o400->MGC10.240.40.190@tcp@10.240.40.190@tcp:26/25 lens 224/224 e 0 to 1 dl 1746916412 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 projid:4294967295
      [27222.575331] LustreError: MGC10.240.40.190@tcp: Connection to MGS (at 10.240.40.190@tcp) was lost; in progress operations using this service will fail
      [27223.636590] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true
      [27223.979889] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
      [27224.154704] Lustre: Failing over lustre-MDT0001
      [27224.162873] ------------[ cut here ]------------
      [27224.163480] refcount_t: addition on 0; use-after-free.
      [27224.164165] WARNING: CPU: 0 PID: 743453 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110
      [27224.165168] Modules linked in: dm_flakey tls osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc rfkill intel_rapl_msr intel_rapl_common virtio_balloon i2c_piix4 pcspkr joydev dm_mod drm fuse ext4 mbcache jbd2 ata_generic virtio_net crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ata_piix libata virtio_blk net_failover failover serio_raw [last unloaded: dm_flakey]
      [27224.170385] CPU: 0 PID: 743453 Comm: umount Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-427.42.1.el9_4.x86_64 #1
      [27224.171730] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [27224.172408] RIP: 0010:refcount_warn_saturate+0x74/0x110
      [27224.173024] Code: 01 01 e8 ff 7b af ff 0f 0b c3 cc cc cc cc 80 3d ca e9 b1 01 00 75 cb 48 c7 c7 80 2a 58 9e c6 05 ba e9 b1 01 01 e8 dc 7b af ff <0f> 0b c3 cc cc cc cc 80 3d a9 e9 b1 01 00 75 a8 48 c7 c7 58 2a 58
      [27224.174889] RSP: 0018:ffffafa4c7233af0 EFLAGS: 00010282
      [27224.175484] RAX: 0000000000000000 RBX: ffff974ab3d0f000 RCX: 0000000000000027
      [27224.176267] RDX: 0000000000000027 RSI: 0000000000027ffb RDI: ffff974b3fc20848
      [27224.177050] RBP: ffff974a84d51cc0 R08: 0000000000000000 R09: 00000000ffff7fff
      [27224.177820] R10: ffffafa4c7233978 R11: ffffffff9efe7a48 R12: ffff974a84d51cf0
      [27224.178605] R13: ffff974aad8b8000 R14: 00000000fffffff4 R15: ffff974a9fdbb27a
      [27224.179381] FS:  00007fd43be33540(0000) GS:ffff974b3fc00000(0000) knlGS:0000000000000000
      [27224.180274] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [27224.180928] CR2: 00007ffec7444918 CR3: 000000002638c004 CR4: 00000000000606f0
      [27224.181697] Call Trace:
      [27224.182052]  <TASK>
      [27224.182341]  ? show_trace_log_lvl+0x1c4/0x2df
      [27224.182882]  ? show_trace_log_lvl+0x1c4/0x2df
      [27224.183432]  ? kthread_stop+0x176/0x180
      [27224.183936]  ? refcount_warn_saturate+0x74/0x110
      [27224.184469]  ? __warn+0x81/0x110
      [27224.184899]  ? refcount_warn_saturate+0x74/0x110
      [27224.185429]  ? report_bug+0x10a/0x140
      [27224.185903]  ? handle_bug+0x3c/0x70
      [27224.186366]  ? exc_invalid_op+0x14/0x70
      [27224.186826]  ? asm_exc_invalid_op+0x16/0x20
      [27224.187360]  ? refcount_warn_saturate+0x74/0x110
      [27224.187916]  kthread_stop+0x176/0x180
      [27224.188367]  mdt_hsm_cdt_stop+0x12b/0x250 [mdt]
      [27224.189174]  ? lu_context_fini+0xa7/0x190 [obdclass]
      [27224.190135]  mdt_fini+0x98/0x580 [mdt]
      [27224.190609]  mdt_device_fini+0x2b/0xc0 [mdt]
      [27224.191172]  obd_precleanup+0xdc/0x280 [obdclass]
      [27224.191789]  ? class_disconnect_exports+0x131/0x300 [obdclass]
      [27224.192514]  class_cleanup+0x2d7/0x600 [obdclass]
      [27224.193150]  class_process_config+0x1102/0x1ab0 [obdclass]
      [27224.193838]  ? class_manual_cleanup+0x161/0x7a0 [obdclass]
      [27224.194519]  class_manual_cleanup+0x43b/0x7a0 [obdclass]
      [27224.195217]  server_put_super+0x98f/0xb40 [ptlrpc]
      [27224.196434]  generic_shutdown_super+0x74/0x120
      [27224.196986]  kill_anon_super+0x14/0x30
      [27224.197435]  deactivate_locked_super+0x31/0xa0
      [27224.197962]  cleanup_mnt+0x100/0x160
      [27224.198414]  task_work_run+0x5c/0x90
      [27224.198872]  exit_to_user_mode_loop+0x122/0x130
      [27224.199416]  exit_to_user_mode_prepare+0xb6/0x100
      [27224.199965]  syscall_exit_to_user_mode+0x12/0x40
      [27224.200507]  do_syscall_64+0x69/0x90
      [27224.200956]  entry_SYSCALL_64_after_hwframe+0x77/0xe1
      [27224.201550] RIP: 0033:0x7fd43bd0df0b
      [27224.202045] Code: 1b bf 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e1 be 0e 00 f7 d8
      [27224.203923] RSP: 002b:00007ffec74479f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [27224.204736] RAX: 0000000000000000 RBX: 00005628b4a759d0 RCX: 00007fd43bd0df0b
      [27224.205518] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005628b4a75820
      [27224.206311] RBP: 00005628b4a6fb30 R08: 0000000000000000 R09: 0000000000000000
      [27224.207099] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [27224.207867] R13: 00005628b4a75820 R14: 00005628b4a6fc40 R15: 00005628b4a6fb30
      [27224.208641]  </TASK>
      [27224.208952] ---[ end trace a3f94a89480afc5c ]---

      later culminating in a crash (null pointer deref):

       [27224.253517] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [27224.254285] #PF: supervisor write access in kernel mode
      [27224.254849] #PF: error_code(0x0002) - not-present page
      [27224.255436] PGD 0 P4D 0
      [27224.255768] Oops: 0002 [#1] PREEMPT SMP PTI
      [27224.256246] CPU: 0 PID: 743453 Comm: umount Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-427.42.1.el9_4.x86_64 #1
      [27224.257429] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [27224.258058] RIP: 0010:kthread_stop+0x45/0x180
      [27224.258550] Code: 00 f0 0f c1 45 30 85 c0 0f 84 40 01 00 00 8d 50 01 09 c2 0f 88 05 01 00 00 f6 45 36 20 0f 84 12 01 00 00 48 8b 9d a8 0b 00 00 <f0> 80 0b 02 48 89 ef e8 8f f6 ff ff 48 89 ef e8 07 b6 01 00 48 8d
      [27224.260372] RSP: 0018:ffffafa4c7233af8 EFLAGS: 00010246
      [27224.260938] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
      [27224.261701] RDX: 0000000000000027 RSI: 0000000000027ffb RDI: ffff974b3fc20848
      [27224.262461] RBP: ffff974a84d51cc0 R08: 0000000000000000 R09: 00000000ffff7fff
      [27224.263213] R10: ffffafa4c7233978 R11: ffffffff9efe7a48 R12: ffff974a84d51cf0
      [27224.263954] R13: ffff974aad8b8000 R14: 00000000fffffff4 R15: ffff974a9fdbb27a
      [27224.264732] FS:  00007fd43be33540(0000) GS:ffff974b3fc00000(0000) knlGS:0000000000000000
      [27224.265577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [27224.266203] CR2: 0000000000000000 CR3: 000000002638c004 CR4: 00000000000606f0
      [27224.266977] Call Trace:
      [27224.267298]  <TASK>
      [27224.267578]  ? show_trace_log_lvl+0x1c4/0x2df
      [27224.268078]  ? show_trace_log_lvl+0x1c4/0x2df
      [27224.268570]  ? mdt_hsm_cdt_stop+0x12b/0x250 [mdt]
      [27224.269160]  ? __die_body.cold+0x8/0xd
      [27224.269597]  ? page_fault_oops+0x134/0x170
      [27224.270081]  ? kernelmode_fixup_or_oops+0x84/0x110
      [27224.270617]  ? exc_page_fault+0x62/0x150
      [27224.271073]  ? asm_exc_page_fault+0x22/0x30
      [27224.271548]  ? kthread_stop+0x45/0x180
      [27224.271984]  ? kthread_stop+0x176/0x180
      [27224.272443]  mdt_hsm_cdt_stop+0x12b/0x250 [mdt]
      [27224.273006]  ? lu_context_fini+0xa7/0x190 [obdclass]
      [27224.273643]  mdt_fini+0x98/0x580 [mdt]
      [27224.274123]  mdt_device_fini+0x2b/0xc0 [mdt]
      [27224.274645]  obd_precleanup+0xdc/0x280 [obdclass]
      [27224.275249]  ? class_disconnect_exports+0x131/0x300 [obdclass]
      [27224.275945]  class_cleanup+0x2d7/0x600 [obdclass]
      [27224.276544]  class_process_config+0x1102/0x1ab0 [obdclass]
      [27224.277223]  ? class_manual_cleanup+0x161/0x7a0 [obdclass]
      [27224.277888]  class_manual_cleanup+0x43b/0x7a0 [obdclass]
      [27224.278547]  server_put_super+0x98f/0xb40 [ptlrpc]
      [27224.279229]  generic_shutdown_super+0x74/0x120
      [27224.279735]  kill_anon_super+0x14/0x30
      [27224.280181]  deactivate_locked_super+0x31/0xa0
      [27224.280684]  cleanup_mnt+0x100/0x160
      [27224.281115]  task_work_run+0x5c/0x90
      [27224.281538]  exit_to_user_mode_loop+0x122/0x130
      [27224.282052]  exit_to_user_mode_prepare+0xb6/0x100
      [27224.282572]  syscall_exit_to_user_mode+0x12/0x40
      [27224.283096]  do_syscall_64+0x69/0x90
      [27224.283522]  entry_SYSCALL_64_after_hwframe+0x77/0xe1
      [27224.284084] RIP: 0033:0x7fd43bd0df0b
      [27224.284513] Code: 1b bf 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e1 be 0e 00 f7 d8

      First recorded hit On Oct 8th 2024 for 2.16.0rc1: https://testing.whamcloud.com/test_sets/d19b194f-c74f-4eba-b81d-406ec4f77cc4

      Hits periodically on master since then (7 occurences to date)

      Attachments

        Activity

          [LU-19008] refcount_t: addition on 0; use-after-free in mdt_hsm_cdt_stop
          pjones Peter Jones added a comment -

          Merged for 2.17

          pjones Peter Jones added a comment - Merged for 2.17

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/59425/
          Subject: LU-19008 hsm: add locking for coordinator thread stop
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 678e2bb63174cf9b4db0d47e1de671ad44b36643

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/59425/ Subject: LU-19008 hsm: add locking for coordinator thread stop Project: fs/lustre-release Branch: master Current Patch Set: Commit: 678e2bb63174cf9b4db0d47e1de671ad44b36643

          "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59425
          Subject: LU-19008 hsm: Add locking for thread stop
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 17a0b74669d631e72d5fc7e7bd853c050896df52

          gerrit Gerrit Updater added a comment - "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59425 Subject: LU-19008 hsm: Add locking for thread stop Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 17a0b74669d631e72d5fc7e7bd853c050896df52

          People

            paf0186 Patrick Farrell
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: