Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6089

qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 16950

    Description

      Had this crash happen on the tip of master as of yesterday running test 132 of sanity.sh:

      Lustre: 3948:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1420636700/re
      al 1420636700]  req@ffff8800290ba380 x1489642625567436/t0(0) o250->MGC192.168.20.154@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1420636706 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3948:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      LustreError: 39:0:(qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed: 
      LustreError: 39:0:(qsd_handler.c:1139:qsd_op_adjust()) LBUG
      Pid: 39, comm: kswapd0
      
      Call Trace:
      [<ffffffffa0779895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa0779e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      [<ffffffffa0f25028>] qsd_op_adjust+0x478/0x580 [lquota]
      [<ffffffffa100a597>] osd_object_delete+0x217/0x2f0 [osd_ldiskfs]
      [<ffffffffa091e0c1>] lu_object_free+0x81/0x1a0 [obdclass]
      [<ffffffffa091f167>] lu_site_purge+0x2e7/0x4e0 [obdclass]
      [<ffffffffa091f4e8>] lu_cache_shrink+0x188/0x310 [obdclass]
      [<ffffffff81138dba>] shrink_slab+0x12a/0x1a0
      [<ffffffff8113c0da>] balance_pgdat+0x59a/0x820
      [<ffffffff8113c494>] kswapd+0x134/0x3b0
      

      Attachments

        Issue Links

          Activity

            [LU-6089] qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed

            I've now hit this once while testing http://review.whamcloud.com/11258 on master instead of the 12515 patch, in a thread doing memory reclaim during an unmount operation, though it wasn't a thread involved in the unmount process:

            Lustre: Failing over testfs-OST0000
            Lustre: server umount testfs-OST0000 complete
            Lustre: Failing over testfs-OST0001
            general protection fault: 0000 [#1] SMP 
            Pid: 2170, comm: java Tainted: P---------------    2.6.32-431.29.2.el6_lustre.g36cd22b.x86_6
            RIP: 0010:[<ffffffffa07a4cf6>] [<ffffffffa07a4cf6>] qsd_op_adjust+0xb6/0x580 [lq
            uota]
            Process java (pid: 2170, threadinfo ffff8800d00ce000, task ffff880037c61540)
            Call Trace:
            osd_object_delete+0x217/0x2f0 [osd_ldiskfs]
            lu_object_free+0x81/0x1a0 [obdclass]
            lu_site_purge+0x2e7/0x4e0 [obdclass]
            lu_cache_shrink+0x188/0x310 [obdclass]
            shrink_slab+0x12a/0x1a0
            do_try_to_free_pages+0x3f7/0x610
            try_to_free_pages+0x92/0x120
            __alloc_pages_nodemask+0x47e/0x8d0
            kmem_getpages+0x62/0x170
            fallback_alloc+0x1ba/0x270
            ____cache_alloc_node+0x99/0x160
            user_path_parent+0x31/0x80
            sys_renameat+0xb8/0x3a0
            sys_rename+0x1b/0x20
            

            I've also gone back and retested the latest version of 12515 and not hit this problem in sanity as I had twice before.

            adilger Andreas Dilger added a comment - I've now hit this once while testing http://review.whamcloud.com/11258 on master instead of the 12515 patch, in a thread doing memory reclaim during an unmount operation, though it wasn't a thread involved in the unmount process: Lustre: Failing over testfs-OST0000 Lustre: server umount testfs-OST0000 complete Lustre: Failing over testfs-OST0001 general protection fault: 0000 [#1] SMP Pid: 2170, comm: java Tainted: P--------------- 2.6.32-431.29.2.el6_lustre.g36cd22b.x86_6 RIP: 0010:[<ffffffffa07a4cf6>] [<ffffffffa07a4cf6>] qsd_op_adjust+0xb6/0x580 [lq uota] Process java (pid: 2170, threadinfo ffff8800d00ce000, task ffff880037c61540) Call Trace: osd_object_delete+0x217/0x2f0 [osd_ldiskfs] lu_object_free+0x81/0x1a0 [obdclass] lu_site_purge+0x2e7/0x4e0 [obdclass] lu_cache_shrink+0x188/0x310 [obdclass] shrink_slab+0x12a/0x1a0 do_try_to_free_pages+0x3f7/0x610 try_to_free_pages+0x92/0x120 __alloc_pages_nodemask+0x47e/0x8d0 kmem_getpages+0x62/0x170 fallback_alloc+0x1ba/0x270 ____cache_alloc_node+0x99/0x160 user_path_parent+0x31/0x80 sys_renameat+0xb8/0x3a0 sys_rename+0x1b/0x20 I've also gone back and retested the latest version of 12515 and not hit this problem in sanity as I had twice before.

            I reverted the 12515 patch, and while I've observed the original LU-5242 problem of not being able to unmount/remount the filesystem quickly, I haven't had any crashes, when I was previously 2-of-2 in sanity.sh while the patch was applied.

            adilger Andreas Dilger added a comment - I reverted the 12515 patch, and while I've observed the original LU-5242 problem of not being able to unmount/remount the filesystem quickly, I haven't had any crashes, when I was previously 2-of-2 in sanity.sh while the patch was applied.

            Hit a very similar crash again in osd_object_delete->qsd_op_adjust() when unmounting in sanity.sh test_65j and I am again testing http://review.whamcloud.com/12515 from LU-5242. It looks like I restarted another test after my previous run that passed sanity.sh but I don't recall for sure if I reverted the patch at that time. I'll need to test without this patch again.

            adilger Andreas Dilger added a comment - Hit a very similar crash again in osd_object_delete->qsd_op_adjust() when unmounting in sanity.sh test_65j and I am again testing http://review.whamcloud.com/12515 from LU-5242 . It looks like I restarted another test after my previous run that passed sanity.sh but I don't recall for sure if I reverted the patch at that time. I'll need to test without this patch again.

            This has the same symptoms of LU-5331, but I'm running the latest master (v2_6_92_0-14-g37145b3 + http://review.whamcloud.com/12515). I'll try reverting that patch to see if it fixes the problem.

            adilger Andreas Dilger added a comment - This has the same symptoms of LU-5331 , but I'm running the latest master (v2_6_92_0-14-g37145b3 + http://review.whamcloud.com/12515 ). I'll try reverting that patch to see if it fixes the problem.

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: