Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18024

sanity-lsnapshot test_1b: Null pointer dreference in queue_work via qmt_lvbo_free

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c1d8852e-126c-4a30-af92-a8fa44082ee9

      test_1b failed with the following error:

      onyx-103vm4 crashed during sanity-lsnapshot test_1b
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4541 - 5.14.0-362.24.1.el9_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-master/4541 - 5.14.0-362.24.1_lustre.el9.x86_64

      for about a month this is a regular crash in sanity-lsnapshot test 1b, traces differ somewhat but always end up in qmt_lvbo_free and then the NULL pointer dereference in the __queue_work:

      [13893.061855] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-lsnapshot test 1b: mount snapshot without original filesystem mounted ========================================================== 08:20:07 \(1718785207\)
      [13893.273668] Lustre: DEBUG MARKER: == sanity-lsnapshot test 1b: mount snapshot without original filesystem mounted ========================================================== 08:20:07 (1718785207)
      [13893.439579] Lustre: DEBUG MARKER: /usr/sbin/lctl snapshot_create -F lustre -n lss_1b_0
      [13895.994830] Lustre: DEBUG MARKER: /usr/sbin/lctl snapshot_list -F lustre -n lss_1b_0 -d
      [13900.112891] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [13900.441082] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
      [13900.721496] BUG: kernel NULL pointer dereference, address: 0000000000000102
      [13900.722489] #PF: supervisor read access in kernel mode
      [13900.723149] #PF: error_code(0x0000) - not-present page
      [13900.723783] PGD 0 P4D 0 
      [13900.724150] Oops: 0000 [#1] PREEMPT SMP PTI
      [13900.724697] CPU: 0 PID: 225194 Comm: umount Kdump: loaded Tainted: P           OE     -------  ---  5.14.0-362.24.1_lustre.el9.x86_64 #1
      [13900.726105] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [13900.726808] RIP: 0010:__queue_work+0x20/0x370
      [13900.727396] Code: 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 49 89 d6 41 55 41 54 41 89 fc 55 48 89 f5 53 48 83 ec 10 89 7c 24 04 <f6> 86 02 01 00 00 01 0f 85 ac 02 00 00 e8 fe c7 07 00 49 c7 c5 ac
      [13900.729479] RSP: 0018:ffffa5290a5e3938 EFLAGS: 00010082
      [13900.730140] RAX: ffffffffc1cd86b0 RBX: 0000000000000202 RCX: 0000000000000000
      [13900.730998] RDX: ffff990ab188e340 RSI: 0000000000000000 RDI: 0000000000002000
      [13900.731848] RBP: 0000000000000000 R08: ffff990aa7fda8b8 R09: ffffa5290a5e3940
      [13900.732701] R10: 0000000000000101 R11: 000000000000000f R12: 0000000000002000
      [13900.733574] R13: ffff990aa7fda82c R14: ffff990ab188e340 R15: 0000000000000000
      [13900.734429] FS:  00007f1bff822540(0000) GS:ffff990b3fc00000(0000) knlGS:0000000000000000
      [13900.735387] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [13900.736097] CR2: 0000000000000102 CR3: 00000000045f6003 CR4: 00000000001706f0
      [13900.736960] Call Trace:
      [13900.737318]  <TASK>
      [13900.737636]  ? show_trace_log_lvl+0x1c4/0x2df
      [13900.738212]  ? show_trace_log_lvl+0x1c4/0x2df
      [13900.738774]  ? queue_work_on+0x24/0x30
      [13900.739268]  ? __die_body.cold+0x8/0xd
      [13900.739765]  ? page_fault_oops+0x134/0x170
      [13900.740329]  ? kernelmode_fixup_or_oops+0x84/0x110
      [13900.740936]  ? exc_page_fault+0x62/0x150
      [13900.741474]  ? asm_exc_page_fault+0x22/0x30
      [13900.742034]  ? __pfx_qmt_lvbo_free+0x10/0x10 [lquota]
      [13900.742772]  ? __queue_work+0x20/0x370
      [13900.743272]  ? __wake_up_common_lock+0x91/0xd0
      [13900.743851]  queue_work_on+0x24/0x30
      [13900.744325]  qmt_lvbo_free+0xaf/0x160 [lquota]
      [13900.744929]  ldlm_resource_putref+0x18a/0x290 [ptlrpc]
      [13900.745721]  cfs_hash_for_each_relax+0x1ab/0x480 [libcfs]
      [13900.746468]  ? __pfx_ldlm_resource_clean+0x10/0x10 [ptlrpc]
      [13900.747268]  ? __pfx_ldlm_resource_clean+0x10/0x10 [ptlrpc]
      [13900.748069]  cfs_hash_for_each_nolock+0x12e/0x210 [libcfs]
      [13900.748755]  ldlm_namespace_cleanup+0x2b/0xc0 [ptlrpc]
      [13900.749514]  __ldlm_namespace_free+0x58/0x4f0 [ptlrpc]
      [13900.750288]  ldlm_namespace_free_prior+0x5a/0x1f0 [ptlrpc]
      [13900.751093]  mdt_fini+0xd6/0x570 [mdt]
      [13900.751631]  mdt_device_fini+0x2b/0xc0 [mdt]
      [13900.752224]  obd_precleanup+0x1e4/0x220 [obdclass]
      [13900.753213]  class_cleanup+0x2d5/0x600 [obdclass]
      [13900.753885]  class_process_config+0x10c0/0x1bc0 [obdclass]
      [13900.754627]  ? __kmalloc+0x19b/0x370
      [13900.755138]  class_manual_cleanup+0x439/0x7a0 [obdclass]
      [13900.755871]  server_put_super+0x7ee/0xa40 [ptlrpc]
      [13900.756604]  generic_shutdown_super+0x74/0x120
      [13900.757193]  kill_anon_super+0x14/0x30
      [13900.757681]  deactivate_locked_super+0x31/0xa0
      [13900.758272]  cleanup_mnt+0x100/0x160
      [13900.758775]  task_work_run+0x5c/0x90
      [13900.759257]  exit_to_user_mode_loop+0x122/0x130
      [13900.759854]  exit_to_user_mode_prepare+0xb6/0x100
      [13900.760450]  syscall_exit_to_user_mode+0x12/0x40
      [13900.761045]  do_syscall_64+0x69/0x90
      [13900.761515]  ? syscall_exit_to_user_mode+0x22/0x40
      [13900.762130]  ? do_syscall_64+0x69/0x90
      [13900.762619]  ? exc_page_fault+0x62/0x150
      [13900.763134]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lsnapshot test_1b - onyx-103vm4 crashed during sanity-lsnapshot test_1b

      Attachments

        Issue Links

          Activity

            [LU-18024] sanity-lsnapshot test_1b: Null pointer dreference in queue_work via qmt_lvbo_free
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56778/
            Subject: LU-18024 quota: fix the order of freeing qmt_lvbo_free_wq
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a0e766235d6d4c045184f0d8a1906a49ad262230

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56778/ Subject: LU-18024 quota: fix the order of freeing qmt_lvbo_free_wq Project: fs/lustre-release Branch: master Current Patch Set: Commit: a0e766235d6d4c045184f0d8a1906a49ad262230

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56778
            Subject: LU-18024 quota: fix the order of freeing qmt_lvbo_free_wq
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 78fba83354320e65ffada65800272f9482e9348b

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56778 Subject: LU-18024 quota: fix the order of freeing qmt_lvbo_free_wq Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 78fba83354320e65ffada65800272f9482e9348b

            The new occurrence could be caused by the using "qmt_lvbo_free_wq" after it is freed in qmt_device_fini

            LustreError: 340202:0:(ldlm_resource.c:1146:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000006:0x1020000:0x1f4].0x0 (ffff9fb626cd6480) refcount nonzero (1) after lock cleanup; forcing cleanup.
            LustreError: 340202:0:(lquota_entry.c:130:lqe_iter_cb()) $$$ Inuse quota entry  qmt:lustre-QMT0000 pool:dt-0x0 id:500 enforced:1 hard:56549476 soft:53856644 granted:5513252 time:0 qunit: 1048576 edquot:0 may_rel:0 revoke:0 default:no
            
            static void lqe_cleanup(struct cfs_hash *hash, bool free_all)
            {
                    struct lqe_iter_data    d;
                    int                     repeat = 0;
                    ENTRY;
            retry:
                    memset(&d, 0, sizeof(d));
                    d.lid_free_all = free_all;
            
                    cfs_hash_for_each_safe(hash, lqe_iter_cb, &d);
            
                    /* In most case, when this function is called on master or
                     * slave finalization, there should be no inuse quota entry.
                     *
                     * If the per-fs quota updating thread is still holding
                     * some entries, we just wait for it's finished. */
                    if (free_all && d.lid_inuse) {
                            CDEBUG(D_QUOTA, "Hash:%p has entries inuse: inuse:%lu, "
                                    "freed:%lu, repeat:%u\n", hash,
                                    d.lid_inuse, d.lid_freed, repeat);
                            repeat++;
                            schedule_timeout_interruptible(cfs_time_seconds(1));  <--- it will wait the LDLM lock to be canceled 
                            goto retry;
                    }
                    EXIT;
            }
            
            static struct lu_device *qmt_device_fini(const struct lu_env *env,
                                                     struct lu_device *ld)
            {
                   ... 
                   if (qmt->qmt_lvbo_free_wq) {
                            destroy_workqueue(qmt->qmt_lvbo_free_wq);
                            qmt->qmt_lvbo_free_wq = NULL;
                    }       
                            
                    /* kill pool instances, if any */
                    qmt_pool_fini(env, qmt);  <--- lqe_cleanup will be called in qmt_pool_fini
                    ...
            }
            
            hongchao.zhang Hongchao Zhang added a comment - The new occurrence could be caused by the using "qmt_lvbo_free_wq" after it is freed in qmt_device_fini LustreError: 340202:0:(ldlm_resource.c:1146:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000006:0x1020000:0x1f4].0x0 (ffff9fb626cd6480) refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 340202:0:(lquota_entry.c:130:lqe_iter_cb()) $$$ Inuse quota entry qmt:lustre-QMT0000 pool:dt-0x0 id:500 enforced:1 hard:56549476 soft:53856644 granted:5513252 time:0 qunit: 1048576 edquot:0 may_rel:0 revoke:0 default:no static void lqe_cleanup(struct cfs_hash *hash, bool free_all) { struct lqe_iter_data d; int repeat = 0; ENTRY; retry: memset(&d, 0, sizeof(d)); d.lid_free_all = free_all; cfs_hash_for_each_safe(hash, lqe_iter_cb, &d); /* In most case, when this function is called on master or * slave finalization, there should be no inuse quota entry. * * If the per-fs quota updating thread is still holding * some entries, we just wait for it's finished. */ if (free_all && d.lid_inuse) { CDEBUG(D_QUOTA, "Hash:%p has entries inuse: inuse:%lu, " "freed:%lu, repeat:%u\n", hash, d.lid_inuse, d.lid_freed, repeat); repeat++; schedule_timeout_interruptible(cfs_time_seconds(1)); <--- it will wait the LDLM lock to be canceled goto retry; } EXIT; } static struct lu_device *qmt_device_fini(const struct lu_env *env, struct lu_device *ld) { ... if (qmt->qmt_lvbo_free_wq) { destroy_workqueue(qmt->qmt_lvbo_free_wq); qmt->qmt_lvbo_free_wq = NULL; } /* kill pool instances, if any */ qmt_pool_fini(env, qmt); <--- lqe_cleanup will be called in qmt_pool_fini ... }
            yujian Jian Yu added a comment -

            Lustre 2.16.0 RC4 test session details:
            clients: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-362.24.1.el9_3.x86_64
            servers: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-362.24.1_lustre.el9.x86_64

            sanity test 160a crashed:

            Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
            Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
            Lustre: Failing over lustre-MDT0000 
            LustreError: 340202:0:(ldlm_resource.c:1146:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000006:0x1020000:0x1f4].0x0 (ffff9fb626cd6480) refcount nonzero (1) after lock cleanup; forcing cleanup.
            LustreError: 340202:0:(lquota_entry.c:130:lqe_iter_cb()) $$$ Inuse quota entry  qmt:lustre-QMT0000 pool:dt-0x0 id:500 enforced:1 hard:56549476 soft:53856644 granted:5513252 time:0 qunit: 1048576 edquot:0 may_rel:0 revoke:0 default:no
            BUG: kernel NULL pointer dereference, address: 0000000000000102
            #PF: supervisor read access in kernel mode 
            #PF: error_code(0x0000) - not-present page
            PGD 0 P4D 0
            Oops: 0000 [#1] PREEMPT SMP PTI
            CPU: 1 PID: 258344 Comm: ldlm_bl_02 Kdump: loaded Tainted: G        W  OE     -------  ---  5.14.0-362.24.1_lustre.el9.x86_64 #1
            Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            RIP: 0010:__queue_work+0x20/0x370
            Code: 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 49 89 d6 41 55 41 54 41 89 fc 55 48 89 f5 53 48 83 ec 10 89 7c 24 04 <f6> 86 02 01 00 00 01 0f 85 ac 02 00 00 e8 fe c7 07 00 49 c7 c5 ac 
            RSP: 0018:ffffbc80c11e7cc8 EFLAGS: 00010082
            RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffff0a00ffffff04
            RDX: ffff9fb6043f5580 RSI: 0000000000000000 RDI: 0000000000002000
            RBP: 0000000000000000 R08: 000000000000000d R09: ffff9fb703d0ac30
            R10: ffffffffffffffff R11: 000000000000000f R12: 0000000000002000
            R13: ffff9fb60369082c R14: ffff9fb6043f5580 R15: 0000000000000000
            FS:  0000000000000000(0000) GS:ffff9fb6bfd00000(0000) knlGS:0000000000000000
            CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 0000000000000102 CR3: 00000000349a6002 CR4: 00000000003706e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            Call Trace:
             <TASK>
             ? show_trace_log_lvl+0x1c4/0x2df
             ? show_trace_log_lvl+0x1c4/0x2df
             ? queue_work_on+0x24/0x30
             ? __die_body.cold+0x8/0xd
             ? page_fault_oops+0x134/0x170
             ? exc_page_fault+0x62/0x150
             ? asm_exc_page_fault+0x22/0x30
             ? __queue_work+0x20/0x370
             queue_work_on+0x24/0x30
             qmt_lvbo_free+0x133/0x1b0 [lquota]
             ldlm_resource_putref+0x18a/0x290 [ptlrpc]
             cfs_hash_for_each_relax+0x1ab/0x480 [libcfs]
             ? __pfx_ldlm_reprocess_res+0x10/0x10 [ptlrpc]
             ? __pfx_ldlm_reprocess_res+0x10/0x10 [ptlrpc]
             cfs_hash_for_each_nolock+0x12e/0x210 [libcfs]
             ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc]
             ldlm_export_cancel_locks+0x177/0x180 [ptlrpc]
             ldlm_bl_thread_main+0x531/0x640 [ptlrpc]
             ? __pfx_autoremove_wake_function+0x10/0x10
             ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]
             kthread+0xe0/0x100
             ? __pfx_kthread+0x10/0x10
             ret_from_fork+0x2c/0x50
             </TASK>
            Modules linked in: dm_flakey tls obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc intel_rapl_msr intel_rapl_common rfkill virtio_balloon pcspkr i2c_piix4 joydev drm fuse ext4 mbcache jbd2 ata_generic ata_piix crct10dif_pclmul libata crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover virtio_blk failover serio_raw [last unloaded: dm_flakey]
            CR2: 0000000000000102
            

            https://testing.whamcloud.com/test_sets/a3779853-11d9-48d9-ac9a-38ede2889ab3 

            yujian Jian Yu added a comment - Lustre 2.16.0 RC4 test session details: clients: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-362.24.1.el9_3.x86_64 servers: https://build.whamcloud.com/job/lustre-master/4587 - 5.14.0-362.24.1_lustre.el9.x86_64 sanity test 160a crashed: Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 Lustre: Failing over lustre-MDT0000 LustreError: 340202:0:(ldlm_resource.c:1146:ldlm_resource_complain()) mdt-lustre-MDT0000_UUID: namespace resource [0x200000006:0x1020000:0x1f4].0x0 (ffff9fb626cd6480) refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 340202:0:(lquota_entry.c:130:lqe_iter_cb()) $$$ Inuse quota entry qmt:lustre-QMT0000 pool:dt-0x0 id:500 enforced:1 hard:56549476 soft:53856644 granted:5513252 time:0 qunit: 1048576 edquot:0 may_rel:0 revoke:0 default:no BUG: kernel NULL pointer dereference, address: 0000000000000102 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 258344 Comm: ldlm_bl_02 Kdump: loaded Tainted: G W OE ------- --- 5.14.0-362.24.1_lustre.el9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__queue_work+0x20/0x370 Code: 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 57 41 56 49 89 d6 41 55 41 54 41 89 fc 55 48 89 f5 53 48 83 ec 10 89 7c 24 04 <f6> 86 02 01 00 00 01 0f 85 ac 02 00 00 e8 fe c7 07 00 49 c7 c5 ac RSP: 0018:ffffbc80c11e7cc8 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffff0a00ffffff04 RDX: ffff9fb6043f5580 RSI: 0000000000000000 RDI: 0000000000002000 RBP: 0000000000000000 R08: 000000000000000d R09: ffff9fb703d0ac30 R10: ffffffffffffffff R11: 000000000000000f R12: 0000000000002000 R13: ffff9fb60369082c R14: ffff9fb6043f5580 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff9fb6bfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000102 CR3: 00000000349a6002 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_trace_log_lvl+0x1c4/0x2df ? show_trace_log_lvl+0x1c4/0x2df ? queue_work_on+0x24/0x30 ? __die_body.cold+0x8/0xd ? page_fault_oops+0x134/0x170 ? exc_page_fault+0x62/0x150 ? asm_exc_page_fault+0x22/0x30 ? __queue_work+0x20/0x370 queue_work_on+0x24/0x30 qmt_lvbo_free+0x133/0x1b0 [lquota] ldlm_resource_putref+0x18a/0x290 [ptlrpc] cfs_hash_for_each_relax+0x1ab/0x480 [libcfs] ? __pfx_ldlm_reprocess_res+0x10/0x10 [ptlrpc] ? __pfx_ldlm_reprocess_res+0x10/0x10 [ptlrpc] cfs_hash_for_each_nolock+0x12e/0x210 [libcfs] ldlm_reprocess_recovery_done+0x8b/0x100 [ptlrpc] ldlm_export_cancel_locks+0x177/0x180 [ptlrpc] ldlm_bl_thread_main+0x531/0x640 [ptlrpc] ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc] kthread+0xe0/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 </TASK> Modules linked in: dm_flakey tls obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc intel_rapl_msr intel_rapl_common rfkill virtio_balloon pcspkr i2c_piix4 joydev drm fuse ext4 mbcache jbd2 ata_generic ata_piix crct10dif_pclmul libata crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover virtio_blk failover serio_raw [last unloaded: dm_flakey] CR2: 0000000000000102 https://testing.whamcloud.com/test_sets/a3779853-11d9-48d9-ac9a-38ede2889ab3  
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56661/
            Subject: LU-18024 quota: relate qmt_lvbo_free_wq and QMT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 846db2f7afb81e15e13fffdbcc86c5448e41768c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56661/ Subject: LU-18024 quota: relate qmt_lvbo_free_wq and QMT Project: fs/lustre-release Branch: master Current Patch Set: Commit: 846db2f7afb81e15e13fffdbcc86c5448e41768c

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56661
            Subject: LU-18024 quota: relate qmt_lvbo_free_wq and QMT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4aa91be222f2c2fe9fc770da40ffcc8e82e4d09d

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56661 Subject: LU-18024 quota: relate qmt_lvbo_free_wq and QMT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4aa91be222f2c2fe9fc770da40ffcc8e82e4d09d
            hongchao.zhang Hongchao Zhang added a comment - - edited

            As per the stack trace, it should be the NULL of qmt_lvbo_free_wq to cause the issue

            int qmt_lvbo_free(struct lu_device *ld, struct ldlm_resource *res)
            {
                    ...
                    if (res->lr_name.name[LUSTRE_RES_ID_QUOTA_SEQ_OFF] != 0) {
                            struct lquota_entry *lqe = res->lr_lvb_data;
            
                            queue_work(qmt_lvbo_free_wq, &lqe->lqe_work);
                    } else {
                    ...
            }
            
            static void __queue_work(int cpu, struct workqueue_struct *wq,
                                     struct work_struct *work)
            {
                    struct pool_workqueue *pwq;
                    struct worker_pool *last_pool;
                    struct list_head *worklist;
                    unsigned int work_flags;
                    unsigned int req_cpu = cpu;
            
                    /*
                     * While a work item is PENDING && off queue, a task trying to
                     * steal the PENDING will busy-loop waiting for it to either get
                     * queued or lose PENDING.  Grabbing PENDING and queueing should
                     * happen with IRQ disabled.
                     */
                    lockdep_assert_irqs_disabled();
            
            
                    /* if draining, only works from the same workqueue are allowed */
                    if (unlikely(wq->flags & __WQ_DRAINING) &&   <-------------------- the offset of flags is 0x100
                        WARN_ON_ONCE(!is_chained_work(wq)))
                            return;
                    rcu_read_lock();
                    ...
            }
            

            the "qmt_lvbo_free_wq" is defined as a static variable of lquota module, it should be released during umounting snapshot
            mount and cause this crash.

            hongchao.zhang Hongchao Zhang added a comment - - edited As per the stack trace, it should be the NULL of qmt_lvbo_free_wq to cause the issue int qmt_lvbo_free(struct lu_device *ld, struct ldlm_resource *res) { ... if (res->lr_name.name[LUSTRE_RES_ID_QUOTA_SEQ_OFF] != 0) { struct lquota_entry *lqe = res->lr_lvb_data; queue_work(qmt_lvbo_free_wq, &lqe->lqe_work); } else { ... } static void __queue_work(int cpu, struct workqueue_struct *wq, struct work_struct *work) { struct pool_workqueue *pwq; struct worker_pool *last_pool; struct list_head *worklist; unsigned int work_flags; unsigned int req_cpu = cpu; /* * While a work item is PENDING && off queue, a task trying to * steal the PENDING will busy-loop waiting for it to either get * queued or lose PENDING. Grabbing PENDING and queueing should * happen with IRQ disabled. */ lockdep_assert_irqs_disabled(); /* if draining, only works from the same workqueue are allowed */ if (unlikely(wq->flags & __WQ_DRAINING) && <-------------------- the offset of flags is 0x100 WARN_ON_ONCE(!is_chained_work(wq))) return; rcu_read_lock(); ... } the "qmt_lvbo_free_wq" is defined as a static variable of lquota module, it should be released during umounting snapshot mount and cause this crash.
            lixi_wc Li Xi added a comment -

            hongchao.zhang Would you please take a look?

            lixi_wc Li Xi added a comment - hongchao.zhang Would you please take a look?

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: