Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17974

sanity-quota test_3b: (qmt_lock.c:957:qmt_id_lock_notify()) ASSERTION( lqe->lqe_is_global )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.6
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e6debda9-ee5f-4175-96f4-71fe2078af95

      test_3b failed with the following error:

      onyx-124vm6 crashed during sanity-quota test_3b
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-b2_15/92 - 4.18.0-477.27.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_15/92 - 4.18.0-477.27.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      [29347.756593] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) ASSERTION( lqe->lqe_is_global ) failed:
      [29347.758422] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) LBUG
      [29347.759616] Pid: 1786057, comm: qsd_reint_qpool 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Thu Jun 20 04:13:41 UTC 2024
      [29347.761346] Call Trace TBD:
      [29347.761975] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
      [29347.762930] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      [29347.763764] [<0>] qmt_id_lock_notify+0x1ee/0x330 [lquota]
      [29347.764777] [<0>] qmt_site_recalc_cb+0x34b/0x550 [lquota]
      [29347.765708] [<0>] cfs_hash_for_each_tight+0x122/0x310 [libcfs]
      [29347.766709] [<0>] qmt_pool_recalc+0x375/0xa80 [lquota]
      [29347.767603] [<0>] kthread+0x134/0x150
      [29347.768319] [<0>] ret_from_fork+0x35/0x40
      [29347.769063] Kernel panic - not syncing: LBUG
      [29347.769804] CPU: 0 PID: 1786057 Comm: qsd_reint_qpool Kdump: loaded Tainted: P OE --------- - - 4.18.0-477.27.1.el8_lustre.x86_64 #1
      [29347.771895] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [29347.772859] Call Trace:
      [29347.773335] dump_stack+0x41/0x60
      [29347.773932] panic+0xe7/0x2ac
      [29347.774485] ? ret_from_fork+0x35/0x40
      [29347.775141] lbug_with_loc.cold.8+0x18/0x18 [libcfs]
      [29347.776002] qmt_id_lock_notify+0x1ee/0x330 [lquota]
      [29347.776858] qmt_site_recalc_cb+0x34b/0x550 [lquota]
      [29347.777706] ? qmt_pool_lqes_lookup_spec+0x340/0x340 [lquota]
      [29347.778678] cfs_hash_for_each_tight+0x122/0x310 [libcfs]
      [29347.779588] qmt_pool_recalc+0x375/0xa80 [lquota]
      [29347.780402] ? __schedule+0x2d9/0x870
      [29347.781050] ? qmt_sarr_get_idx+0x90/0x90 [lquota]
      [29347.781873] ? qmt_sarr_get_idx+0x90/0x90 [lquota]
      [29347.782697] kthread+0x134/0x150
      [29347.783271] ? set_kthread_struct+0x50/0x50
      [29347.783991] ret_from_fork+0x35/0x40

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-quota test_3b - onyx-124vm6 crashed during sanity-quota test_3b

      Attachments

        Issue Links

          Activity

            [LU-17974] sanity-quota test_3b: (qmt_lock.c:957:qmt_id_lock_notify()) ASSERTION( lqe->lqe_is_global )

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55536/
            Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 8089ca3aacee5ac059cd708c99c23c3a8da9e7b9

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55536/ Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 8089ca3aacee5ac059cd708c99c23c3a8da9e7b9
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55535/
            Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c97b327758f06f6bf3229126e9aa7b36865e7b92

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55535/ Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec Project: fs/lustre-release Branch: master Current Patch Set: Commit: c97b327758f06f6bf3229126e9aa7b36865e7b92

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55536
            Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 2412cbdbaf039334b943cbae17740150bb3ee7aa

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55536 Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 2412cbdbaf039334b943cbae17740150bb3ee7aa

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55535
            Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 10b3f95958830aa369dd53e5b504fae6371d2be6

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55535 Subject: LU-17974 quota: fix qmt_pool_lqes_lookup_spec Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 10b3f95958830aa369dd53e5b504fae6371d2be6

            Ok, vmcore analysis proved my 2nd hypothesis. There is a just one lqe that belongs to qpool1, i.e non global lqe. Appropriate global lqe with the same ID is not enforced - that's why qmt_pool_lqes_lookup_spec hasn't added this lqe into lqe array. I'll send a fix.

            scherementsev Sergey Cheremencev added a comment - Ok, vmcore analysis proved my 2nd hypothesis. There is a just one lqe that belongs to qpool1, i.e non global lqe. Appropriate global lqe with the same ID is not enforced - that's why qmt_pool_lqes_lookup_spec hasn't added this lqe into lqe array. I'll send a fix.
            [29347.756593] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) ASSERTION( lqe->lqe_is_global ) failed:
            [29347.758422] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) LBUG
            [29347.759616] Pid: 1786057, comm: qsd_reint_qpool 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Thu Jun 20 04:13:41 UTC 2024
            [29347.761346] Call Trace TBD:
            [29347.761975] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
            [29347.762930] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
            [29347.763764] [<0>] qmt_id_lock_notify+0x1ee/0x330 [lquota]
            [29347.764777] [<0>] qmt_site_recalc_cb+0x34b/0x550 [lquota]
            [29347.765708] [<0>] cfs_hash_for_each_tight+0x122/0x310 [libcfs]
            [29347.766709] [<0>] qmt_pool_recalc+0x375/0xa80 [lquota]
            [29347.767603] [<0>] kthread+0x134/0x150
            [29347.768319] [<0>] ret_from_fork+0x35/0x40
            [29347.769063] Kernel panic - not syncing: LBUG

            I haven't looked into vmcore yet. Have 2 hypothesis of above panic:
            1. qmt_pool_lqes_lookup_spec didn't find any lqe and returned 0. If somehow qti_qles(env)[index] had a pointer to a valid non global lqe it might be the reason of assertion in qmt_id_lock_notify. In a such case https://review.whamcloud.com/55518 will help here.

            2. qmt_pool_lqes_lookup_spec found enforced lqe but it is not global. If there is no enforced global lqe, we don't need to notify any quota slaves. If so  https://review.whamcloud.com/55518  won't help. At first look this problem might exist even in master.

            After vmcore analysis I will have a clear understanding about the 2nd item. If my guess is right I'll prepare a fix.

            scherementsev Sergey Cheremencev added a comment - [29347.756593] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) ASSERTION( lqe->lqe_is_global ) failed: [29347.758422] LustreError: 1786057:0:(qmt_lock.c:957:qmt_id_lock_notify()) LBUG [29347.759616] Pid: 1786057, comm: qsd_reint_qpool 4.18.0-477.27.1.el8_lustre.x86_64 #1 SMP Thu Jun 20 04:13:41 UTC 2024 [29347.761346] Call Trace TBD: [29347.761975] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [29347.762930] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [29347.763764] [<0>] qmt_id_lock_notify+0x1ee/0x330 [lquota] [29347.764777] [<0>] qmt_site_recalc_cb+0x34b/0x550 [lquota] [29347.765708] [<0>] cfs_hash_for_each_tight+0x122/0x310 [libcfs] [29347.766709] [<0>] qmt_pool_recalc+0x375/0xa80 [lquota] [29347.767603] [<0>] kthread+0x134/0x150 [29347.768319] [<0>] ret_from_fork+0x35/0x40 [29347.769063] Kernel panic - not syncing: LBUG I haven't looked into vmcore yet. Have 2 hypothesis of above panic: 1. qmt_pool_lqes_lookup_spec didn't find any lqe and returned 0. If somehow qti_qles(env) [index] had a pointer to a valid non global lqe it might be the reason of assertion in qmt_id_lock_notify. In a such case https://review.whamcloud.com/55518 will help here. 2. qmt_pool_lqes_lookup_spec found enforced lqe but it is not global. If there is no enforced global lqe, we don't need to notify any quota slaves. If so  https://review.whamcloud.com/55518   won't help . At first look this problem might exist even in master. After vmcore analysis I will have a clear understanding about the 2nd item. If my guess is right I'll prepare a fix.
            [31293.894221] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8
            [31293.895608] PGD 0 P4D 0 
            [31293.896080] Oops: 0002 [#1] SMP PTI
            [31293.896704] CPU: 0 PID: 1800421 Comm: qsd_reint_qpool Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-513.24.1.el8_lustre.x86_64 #1
            [31293.898789] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [31293.899737] RIP: 0010:mutex_lock+0x19/0x30
            [31293.900481] Code: 00 0f 1f 44 00 00 be 02 00 00 00 e9 d1 fb ff ff 90 0f 1f 44 00 00 53 48 89 fb e8 02 e0 ff ff 31 c0 65 48 8b 14 25 40 dc 01 00 <f0> 48 0f b1 13 74 06 48 89 df 5b eb ca 5b c3 cc cc cc cc 0f 1f 40
            [31293.903416] RSP: 0018:ffffb49c42397d68 EFLAGS: 00010246
            [31293.904301] RAX: 0000000000000000 RBX: 00000000000000d8 RCX: 0000000000000000
            [31293.905458] RDX: ffff9d78a459d000 RSI: ffffffffc1822720 RDI: 00000000000000d8
            [31293.906620] RBP: ffffb49c42397e20 R08: 0000000000000752 R09: ffff9d78a64f1000
            [31293.907792] R10: ffffb49c42397d28 R11: ffff9d78c7851750 R12: ffff9d78e4879800
            [31293.908954] R13: 0000000000000000 R14: 00000000000000d8 R15: 0000000000000002
            [31293.910117] FS:  0000000000000000(0000) GS:ffff9d793fc00000(0000) knlGS:0000000000000000
            [31293.911430] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [31293.912449] CR2: 00000000000000d8 CR3: 000000002fe10006 CR4: 00000000001706f0
            [31293.913620] Call Trace:
            [31293.914096]  ? __die_body+0x1a/0x60
            [31293.914733]  ? no_context+0x1ba/0x3f0
            [31293.915386]  ? __bad_area_nosemaphore+0x16c/0x1c0
            [31293.916188]  ? do_page_fault+0x37/0x12d
            [31293.916850]  ? page_fault+0x1e/0x30
            [31293.917452]  ? mutex_lock+0x19/0x30
            [31293.918066]  ? mutex_lock+0xe/0x30
            [31293.918662]  qmt_site_recalc_cb+0x31a/0x550 [lquota]
            [31293.919615]  ? qmt_pool_lqes_lookup_spec+0x340/0x340 [lquota]
            [31293.920582]  cfs_hash_for_each_tight+0x122/0x310 [libcfs]
            [31293.921557]  qmt_pool_recalc+0x375/0xa80 [lquota]
            [31293.922369]  ? __schedule+0x2d9/0x870
            [31293.923011]  ? qmt_sarr_get_idx+0x90/0x90 [lquota]
            [31293.923833]  ? qmt_sarr_get_idx+0x90/0x90 [lquota]
            [31293.924649]  kthread+0x134/0x150
            [31293.925233]  ? set_kthread_struct+0x50/0x50
            [31293.925950]  ret_from_fork+0x35/0x40 

            Above failure should be definitely fixed by https://review.whamcloud.com/55518 "LU-16341 quota: fix panic in qmt_site_recalc_cb"

            scherementsev Sergey Cheremencev added a comment - [31293.894221] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8 [31293.895608] PGD 0 P4D 0 [31293.896080] Oops: 0002 [#1] SMP PTI [31293.896704] CPU: 0 PID: 1800421 Comm: qsd_reint_qpool Kdump: loaded Tainted: P OE --------- - - 4.18.0-513.24.1.el8_lustre.x86_64 #1 [31293.898789] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [31293.899737] RIP: 0010:mutex_lock+0x19/0x30 [31293.900481] Code: 00 0f 1f 44 00 00 be 02 00 00 00 e9 d1 fb ff ff 90 0f 1f 44 00 00 53 48 89 fb e8 02 e0 ff ff 31 c0 65 48 8b 14 25 40 dc 01 00 <f0> 48 0f b1 13 74 06 48 89 df 5b eb ca 5b c3 cc cc cc cc 0f 1f 40 [31293.903416] RSP: 0018:ffffb49c42397d68 EFLAGS: 00010246 [31293.904301] RAX: 0000000000000000 RBX: 00000000000000d8 RCX: 0000000000000000 [31293.905458] RDX: ffff9d78a459d000 RSI: ffffffffc1822720 RDI: 00000000000000d8 [31293.906620] RBP: ffffb49c42397e20 R08: 0000000000000752 R09: ffff9d78a64f1000 [31293.907792] R10: ffffb49c42397d28 R11: ffff9d78c7851750 R12: ffff9d78e4879800 [31293.908954] R13: 0000000000000000 R14: 00000000000000d8 R15: 0000000000000002 [31293.910117] FS: 0000000000000000(0000) GS:ffff9d793fc00000(0000) knlGS:0000000000000000 [31293.911430] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [31293.912449] CR2: 00000000000000d8 CR3: 000000002fe10006 CR4: 00000000001706f0 [31293.913620] Call Trace: [31293.914096] ? __die_body+0x1a/0x60 [31293.914733] ? no_context+0x1ba/0x3f0 [31293.915386] ? __bad_area_nosemaphore+0x16c/0x1c0 [31293.916188] ? do_page_fault+0x37/0x12d [31293.916850] ? page_fault+0x1e/0x30 [31293.917452] ? mutex_lock+0x19/0x30 [31293.918066] ? mutex_lock+0xe/0x30 [31293.918662] qmt_site_recalc_cb+0x31a/0x550 [lquota] [31293.919615] ? qmt_pool_lqes_lookup_spec+0x340/0x340 [lquota] [31293.920582] cfs_hash_for_each_tight+0x122/0x310 [libcfs] [31293.921557] qmt_pool_recalc+0x375/0xa80 [lquota] [31293.922369] ? __schedule+0x2d9/0x870 [31293.923011] ? qmt_sarr_get_idx+0x90/0x90 [lquota] [31293.923833] ? qmt_sarr_get_idx+0x90/0x90 [lquota] [31293.924649] kthread+0x134/0x150 [31293.925233] ? set_kthread_struct+0x50/0x50 [31293.925950] ret_from_fork+0x35/0x40 Above failure should be definitely fixed by https://review.whamcloud.com/55518 " LU-16341 quota: fix panic in qmt_site_recalc_cb"

            I've backported patch https://review.whamcloud.com/55518 "LU-16341 quota: fix panic in qmt_site_recalc_cb" to b2_15 on the hope that this is related (including the test exclusion patches for test_14 and test_1b for older versions of the code that do not have this fix). However, I haven't done any in-depth analysis of the actual crashes to know if that is the source of the problem. I did see that 2.15.5-RC1 had included patch https://review.whamcloud.com/55035 "LU-17034 quota: tmp fix against memory corruption" which landed AFTER LU-16341, so it might still be that bug with a slightly different stack.

            adilger Andreas Dilger added a comment - I've backported patch https://review.whamcloud.com/55518 " LU-16341 quota: fix panic in qmt_site_recalc_cb " to b2_15 on the hope that this is related (including the test exclusion patches for test_14 and test_1b for older versions of the code that do not have this fix). However, I haven't done any in-depth analysis of the actual crashes to know if that is the source of the problem. I did see that 2.15.5-RC1 had included patch https://review.whamcloud.com/55035 " LU-17034 quota: tmp fix against memory corruption " which landed AFTER LU-16341 , so it might still be that bug with a slightly different stack.

            "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55526
            Subject: LU-17974: tests: add more test params
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 66b9ec5c3a575ef8379955af3d6addd83e10022e

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55526 Subject: LU-17974 : tests: add more test params Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 66b9ec5c3a575ef8379955af3d6addd83e10022e

            People

              scherementsev Sergey Cheremencev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: