[LU-4249] exception RIP: lqe64_hash_keycmp+12 Created: 13/Nov/13  Updated: 26/Feb/15  Resolved: 08/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.3

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

our source tree is at https://github.com/jlan/lustre-nas
Server running 2.4.0-4nasS


Attachments: File debug.tgz     File service188.nov13.2013.tgz     File syslog.tgz    
Issue Links:
Duplicate
is duplicated by LU-4807 repeating DQACQ failed with -37 Closed
Related
is related to LU-4633 exception RIP: qsd_entry_iter_cb+29 Resolved
Severity: 3
Rank (Obsolete): 11583

 Description   

OSS crashed with Projection Fault at lqe64_hash_keycmp

See attached files.
Looks like we are crashing here

libcfs/libcfs/hash.c
static cfs_hlist_node_t *
cfs_hash_bd_lookup_intent(cfs_hash_t *hs, cfs_hash_bd_t *bd,
.
.
.
cfs_hlist_for_each(ehnode, hhead) { <<<<<<<< CRASH
if (!cfs_hash_keycmp(hs, key, ehnode))
continue;

cfs_hlist_for_each defined as hlist_for_each but hlist_for_each doesn't seem to be defined any where.

See attached backtrace info.



 Comments   
Comment by Niu Yawei (Inactive) [ 14/Nov/13 ]

hlist_for_each() is a kernel function. Mahmoud, do you know if this happened when shutding down OST? Thanks.

Comment by Mahmoud Hanafi [ 14/Nov/13 ]

This was not during shutdown

Comment by Mahmoud Hanafi [ 15/Nov/13 ]

We hit this bug again today. Is there any progress coming up with a cause/fix?

Comment by Andreas Dilger [ 15/Nov/13 ]

Mahmoud, could you please attach the console log from before the crash and the oops message itself.

Comment by Mahmoud Hanafi [ 16/Nov/13 ]

Here 1 Hour before the oops and the oops message itself.
[-- MARK -- Fri Nov 15 11:00:00 2013]^M
LustreError: 13805:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:nbp7-OST004e qtype:usr id:12157 enforced:1 granted:3639632 pending:0 waiting:0 req:1 usage:3449480 qunit:1048576 qtune:262144 edquot:0^M
LustreError: 13805:0:(qsd_handler.c:344:qsd_req_completion()) Skipped 52 previous similar messages^M
LustreError: 13891:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:nbp7-OST002a qtype:usr id:12157 enforced:1 granted:4194304 pending:0 waiting:0 req:1 usage:4009436 qunit:1048576 qtune:262144 edquot:0^M
LustreError: 13891:0:(qsd_handler.c:344:qsd_req_completion()) Skipped 58 previous similar messages^M
LustreError: 14039:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:nbp7-OST002e qtype:usr id:12157 enforced:1 granted:4157724 pending:0 waiting:0 req:1 usage:3963432 qunit:1048576 qtune:262144 edquot:0^M
LustreError: 14039:0:(qsd_handler.c:344:qsd_req_completion()) Skipped 48 previous similar messages^M
Lustre: nbp7-OST004e: haven't heard from client fefdd2bf-1bc8-70b6-3aeb-de9d92f342d8 (at 10.151.1.109@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff88155de95400, cur 1384543303 expire 1384543153 last 1384543076^M
Lustre: Skipped 10 previous similar messages^M
Lustre: nbp7-OST0042: haven't heard from client fefdd2bf-1bc8-70b6-3aeb-de9d92f342d8 (at 10.151.1.109@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881a0cb53000, cur 1384543303 expire 1384543153 last 1384543076^M
LNet: 2974:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Conn stale 10.151.17.83@o2ib [old ver: 12, new ver: 12]^M
LustreError: 14039:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:nbp7-OST002e qtype:usr id:12157 enforced:1 granted:4157724 pending:0 waiting:0 req:1 usage:3963432 qunit:1048576 qtune:262144 edquot:0^M
LustreError: 14039:0:(qsd_handler.c:344:qsd_req_completion()) Skipped 562 previous similar messages^M
general protection fault: 0000 1 SMP ^M
last sysfs file: /sys/devices/pci0000:80/0000:80:03.0/0000:86:00.0/host15/rport-15:0-0/target15:0:0/15:0:0:39/state^M
^M
Entering kdb (current=0xffff8810259ed500, pid 306) on processor 19 Oops: (null)^M
due to oops @ 0xffffffffa0c72f2c^M
r15 = 0x5a5a5a5a5a5a5a5a r14 = 0x0000000000000000 ^M
r13 = 0xffff880e9509ef28 r12 = 0xffff8810259ef9d0 ^M
bp = 0xffff8810259ef940 bx = 0xffff881f55c5f900 ^M
r11 = 0x6000000000000000 r10 = 0x0000000000001e70 ^M
r9 = 0x000000000000e56f r8 = 0x0000000000000003 ^M
ax = 0x0000000000002f7d cx = 0x0000000000000000 ^M
dx = 0x0000000000000000 si = 0x5a5a5a5a5a5a5a5a ^M
di = 0xffff880e9509ef28 orig_ax = 0xffffffffffffffff ^M
ip = 0xffffffffa0c72f2c cs = 0x0000000000000010 ^M
flags = 0x0000000000010206 sp = 0xffff8810259ef930 ^M
ss = 0x0000000000000018 &regs = 0xffff8810259ef898^M
[19]kdb> [-- MARK -- Fri Nov 15 12:00:00 2013]^M
[-- root@localhost attached -- Fri Nov 15 12:01:54 2013]^M
[-- root@localhost detached -- Fri Nov 15 12:08:44 2013]^M
[-- root@localhost attached -- Fri Nov 15 12:10:07 2013]^M

Comment by Niu Yawei (Inactive) [ 18/Nov/13 ]

Looks like the lqe is freed unexpectedly, but I didn't see from the code how it will be freed in such case. Anyway, the message:

LustreError: 13891:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:nbp7-OST002a qtype:usr id:12157 enforced:1 granted:4194304 pending:0 waiting:0 req:1 usage:4009436 qunit:1048576 qtune:262144 edquot:0^M

indicating something was wrong. If such error message can be reproduced, could you reproduce it and capture the logs on both OST and MDT with D_QUOTA & D_TRACE enabled? I'd like to figure out why the acquire request always failed with -ENOLCK.

Comment by Mahmoud Hanafi [ 18/Nov/13 ]

Uploaded debug logs:

ftp://ftp.whamcloud.com/uploads/LU-4249.debuglogs.tgz

Comment by Niu Yawei (Inactive) [ 19/Nov/13 ]

The numerous ENOLCK errors is possibly caused by a race:

  • The per-ID quota lock is being canceled on quota slave, and the corresponding lqe_lockh is cleared, see qsd_id_blocking_ast():
                    lqe_write_lock(lqe);
                    if (lustre_handle_equal(&lockh, &lqe->lqe_lockh)) {
                            /* Clear lqe_lockh & reset qunit to 0 */
                            qsd_set_qunit(lqe, 0);
                            memset(&lqe->lqe_lockh, 0, sizeof(lqe->lqe_lockh));
                            lqe->lqe_edquot = false;
                            rel = true;
                    }
                    lqe_write_unlock(lqe);
    
  • While just after the lqe_lockh is cleared, reply of pending DQACQ is received, the the lqe_lockh is set back to old lockh again, see qsd_req_completion():
            /* Set the lqe_lockh */
            if (lustre_handle_is_used(lockh) &&
                !lustre_handle_equal(lockh, &lqe->lqe_lockh))
                    lustre_handle_copy(&lqe->lqe_lockh, lockh);
    
    

Then pre-acquire will always try to get lock with stale lockh, and fail with -ENLOCK.

But I didn't see why it cause crash in lqe64_hash_keycmp(), anyway, I'll cooke a patch to fix the race first.

Comment by Niu Yawei (Inactive) [ 19/Nov/13 ]

patch to fix the -ENOLCK error: http://review.whamcloud.com/8322

Comment by Niu Yawei (Inactive) [ 22/Nov/13 ]

Looks my analysis about the ENOLCK isn't quite right: the DQACQ should have held lock reference, so lock can't be canceled when there is pending DQACQ. Mahmoud, do you know what kind of operations could cause the problem? Is your source tree same as our source tree?

Comment by Peter Jones [ 25/Nov/13 ]

Niu

NASA do include some patches. There is a link to their tree in the Environment field above

Peter

Comment by Mahmoud Hanafi [ 25/Nov/13 ]

We had another crash. We do have a crash dump. But needs to be only view by US Citizen.

I tried to get more info on the DQACQ error. Got a list of users that are generating the error. And then looked at there quota. Found some odd things. Some users are have used files count much less than what the are using. for example

  1. find /nobackupp7/ppritche/ -type f -user ppritche| wc -l
    22986
  1. lfs quota -u ppritche /nobackupp7
    Disk quotas for user ppritche (uid 11563):
    Filesystem kbytes quota limit grace files quota limit grace
    /nobackupp7 2040023252 2300000000 2600000000 - 30 75000 150000 -

lfs quota says he is using 30 files but find show >22000 files. there are other examples like this.

Comment by Keith Mannthey (Inactive) [ 25/Nov/13 ]

I am able to look at this Crashdump as I have done before. I send an email.

Comment by Niu Yawei (Inactive) [ 26/Nov/13 ]

Thank you, Keith, let's see if there is anything new in the new crash.

Some users are have used files count much less than what the are using

Is this a new system or upgraded from old system? Will 'lfs quota' show the newly created files? Can quotacheck fix the inconsistence problem? ('tune2fs -O ^quota' to disable quota then 'tune2fs -O quota' to enable quota, with uptodate e2fsprogs, see LU-3861)

Comment by Mahmoud Hanafi [ 26/Nov/13 ]

This is a filesystem that was upgraded from 2.1.5 to 2.4. We initially ran into this file undercount and did the tune2fs command. One thing to note about these users. Their files are initially copied as root to this filesystem. They their ownership is changed. Here is an example

/nobackupp7 <-Top level directory
/nobackupp7/ARRIVALS.anudelma <-- Owned by user:anudelma
/nobackupp7/ARRIVALS.anudelma/jliu7 <-- owned by jlui7 and all files under this dir.

  1. lfs quota -u jliu7 /nobackupp7
    Disk quotas for user jliu7 (uid 11925):
    Filesystem kbytes quota limit grace files quota limit grace
    /nobackupp7 4340116920 0 0 - 0 0 0 -

So when the owner ship is changed the file are not accounted for.

Question Using tune2fs:
Once we unmount the ost and run 'tune2fs -O ^quota', do we need to mount and umount it before running 'tune2fs -O quota'

Keith,
I uploaded the core file see you email for info

Mahmoud

Comment by Keith Mannthey (Inactive) [ 26/Nov/13 ]

I will take a good look this crash and report back.

Comment by Niu Yawei (Inactive) [ 27/Nov/13 ]

So when the owner ship is changed the file are not accounted for.

Is this reproduceable? Any error messages in the log?

Question Using tune2fs:
Once we unmount the ost and run 'tune2fs -O ^quota', do we need to mount and umount it before running 'tune2fs -O quota'

No, no need to do mount and umount.

Comment by Mahmoud Hanafi [ 21/Dec/13 ]

Keith,

We need some update on this? Any luck extracting info from the core files?

Comment by Cliff White (Inactive) [ 13/Jan/14 ]

I have taken over this from Kieth, very sorry about the delay. I have the dump unpacked, and should be able to get you some information this week.

Comment by Cliff White (Inactive) [ 14/Jan/14 ]

I see you are using a non-standard build - would it be possible for you to unload your vmlinux for this crash dump?

Comment by Mahmoud Hanafi [ 14/Jan/14 ]

I upload the LU-4249RPMS.tgz to ftp.whamcloud.com which will have the needed file you need.

Comment by Mahmoud Hanafi [ 14/Jan/14 ]

Cliff,
Have you had a chance to look at the crashdump? We hit this bug 3 time this morning.

Thanks,
Mahmoud

Comment by Niu Yawei (Inactive) [ 15/Jan/14 ]

Mahmoud, Cliff has already checked the crashdump, and the crashdump shows that it crashed in same place as the first crash you posted in this ticket.

I'm not sure if it's related to the cfs_hash bug, but you'd apply e44670a0301d3edf13ebd1bc728ad4c797369de2 (LU-4362)

Comment by Mahmoud Hanafi [ 15/Jan/14 ]

We will apply LU-4362 and test.

Thanks.

Comment by Mahmoud Hanafi [ 30/Jan/14 ]

Looks like LU-4362 DID not fix this crash. We just had 3 oss crash at the same time. Looks like this is very sensitive to mds recovery.

ustreError: 13476:0:(qsd_handler.c:344:qsd_req_completion()) Skipped 9 previous similar messages^M
LustreError: 167-0: nbp7-MDT0000-lwp-OST0002: This client was evicted by nbp7-MDT0000; in progress operations using this service will fail.^M
LustreError: Skipped 1 previous similar message^M
Lustre: Evicted from MGS (at 10.151.27.38@o2ib) after server handle changed from 0xc262b1e69b1d10fc to 0xe69d9b7903f78200^M
Lustre: MGC10.151.27.38@o2ib: Connection restored to MGS (at 10.151.27.38@o2ib)^M
Lustre: Skipped 2 previous similar messages^M
Lustre: 82396:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST000e: failed to report quota. [0x200000006:0x20000:0x0], -37^M
Lustre: 82396:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST000e: reconciliation failed. [0x0:0x0:0x0], -37^M
Lustre: nbp7-OST0002: deleting orphan objects from 0x0:7609774 to 0x0:7610145^M
Lustre: nbp7-OST000e: deleting orphan objects from 0x0:7567992 to 0x0:7568097^M
Lustre: nbp7-OST001a: deleting orphan objects from 0x0:7599891 to 0x0:7600225^M
general protection fault: 0000 1 SMP ^M
last sysfs file: /sys/devices/pci0000:40/0000:40:03.2/0000:44:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:27/state^M
15 out of 16 cpus in kdb, waiting for the rest, timeout in 10 second(s)^M
...1 cpu is not in kdb, its state is unknown^M
^M
Entering kdb (current=0xffff880bdf4a3540, pid 18116) on processor 0 Oops: (null)^M
due to oops @ 0xffffffffa0c59f2c^M
r15 = 0x5a5a5a5a5a5a5a5a r14 = 0x0000000000000000 ^M
r13 = 0xffff880950324f60 r12 = 0xffff880be47436c0 ^M
bp = 0xffff880be4743630 bx = 0xffff8806048505c0 ^M
r11 = 0x0000000000000000 r10 = 0x00000000000007ca ^M
r9 = 0x0000000000000001 r8 = 0x0000000000000003 ^M
ax = 0x0000000000002ccb cx = 0x0000000000000000 ^M
dx = 0x0000000000000000 si = 0x5a5a5a5a5a5a5a5a ^M
di = 0xffff880950324f60 orig_ax = 0xffffffffffffffff ^M
ip = 0xffffffffa0c59f2c cs = 0x0000000000000010 ^M
flags = 0x0000000000010206 sp = 0xffff880be4743620 ^M
ss = 0x0000000000000018 &regs = 0xffff880be4743588^M

Comment by Hongchao Zhang [ 30/Jan/14 ]

from the assembly code of lqe64_hask_keycmp, the problem should be the lquota_entry managed by cfs_hash is destroyed,

0xffffffffa0bf8f2c <lqe64_hash_keycmp+12>:	cmp    %rax,0x10(%rsi)   <-- 0x10(%rsi) is the "lqe->lqe_id.qid_uid"
    [exception RIP: lqe64_hash_keycmp+12]
    RIP: ffffffffa0bf8f2c  RSP: ffff88058e5736d0  RFLAGS: 00010206
    RAX: 0000000000002f7d  RBX: ffff880fd6450e00  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 5a5a5a5a5a5a5a5a  RDI: ffff88058f1b2f60
    RBP: ffff88058e5736d0   R8: 0000000000000003   R9: 0000000000000000
    R10: ffff88058f1b2f28  R11: ffffffffa0cac2f0  R12: ffff88058e573760
    R13: ffff88058f1b2f60  R14: 0000000000000000  R15: 5a5a5a5a5a5a5a5a

the RSI is 0x5a5a5a5a5a5a5a5a, which means the corresponding lquota_entry is freed.

Comment by Mahmoud Hanafi [ 30/Jan/14 ]

Can we expect a patch soon. This is our most critical issue.

Comment by Johann Lombardi (Inactive) [ 31/Jan/14 ]

Hi Mahmoud,

One possible scenario is that the quota code screws up the refcount on the lquota_entry structure, causing a lqe to be freed while still referenced in the hash table. We are working on a diagnostic patch under http://review.whamcloud.com/#/c/9070. We would like to make sure it passes our regression tests before you apply it. Stay tuned.

Comment by Mahmoud Hanafi [ 12/Mar/14 ]

We got a crash and here is the debug output.

LustreError: 167-0: nbp7-MDT0000-lwp-OST004a: This client was evicted by nbp7-MDT0000; in progress operations using this service will fail.
Lustre: nbp7-MDT0000-lwp-OST003e: Connection restored to nbp7-MDT0000 (at 10.151.27.38@o2ib)
Lustre: Evicted from MGS (at 10.151.27.38@o2ib) after server handle changed from 0x77507a47a4c36f84 to 0x5127d6849d7f5d1c
Lustre: nbp7-MDT0000-lwp-OST004a: Connection restored to nbp7-MDT0000 (at 10.151.27.38@o2ib)
Lustre: 45157:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST0032: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45157:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST0032: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 45187:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST0026: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45187:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST0026: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 45188:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST0042: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45188:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST0042: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 45189:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST002a: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45189:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST002a: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: nbp7-OST0026: deleting orphan objects from 0x0:16242960 to 0x0:16242977
Lustre: nbp7-OST002a: deleting orphan objects from 0x0:16088539 to 0x0:16088557
Lustre: nbp7-OST002e: deleting orphan objects from 0x0:15914638 to 0x0:15914671
Lustre: nbp7-OST003a: deleting orphan objects from 0x0:15238739 to 0x0:15238764
Lustre: nbp7-OST0042: deleting orphan objects from 0x0:15604504 to 0x0:15604524
Lustre: nbp7-OST003e: deleting orphan objects from 0x0:16407770 to 0x0:16407841
Lustre: nbp7-OST004a: deleting orphan objects from 0x0:15822277 to 0x0:15822318
Lustre: nbp7-OST004e: deleting orphan objects from 0x0:15730335 to 0x0:15730375
Lustre: nbp7-OST0036: deleting orphan objects from 0x0:14913994 to 0x0:14914030
Lustre: nbp7-OST0046: deleting orphan objects from 0x0:15378324 to 0x0:15378342
Lustre: nbp7-OST0032: deleting orphan objects from 0x0:15902313 to 0x0:15902368
Lustre: nbp7-OST0052: deleting orphan objects from 0x0:15714908 to 0x0:15714926
Lustre: 45191:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST0046: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45191:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST0046: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 45209:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST003e: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 45209:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST003e: reconciliation failed. [0x0:0x0:0x0], -37
LustreError: 11901:0:(lquota_internal.h:272:lqe_putref()) $$$ Freeing quota entry while it is still referenced in the hash!!!
 qsd:nbp7-OST002a qtype:usr id:11920 enforced:1 granted:31459296 pending:0 waiting:0 req:0 usage:31396440 qunit:4194304 qtune:524288 edquot:0
LustreError: 11901:0:(lquota_internal.h:273:lqe_putref()) LBUG
Comment by Johann Lombardi (Inactive) [ 12/Mar/14 ]

There should be a stack trace dumped after the LBUG. Could you please paste it here?

Comment by Mahmoud Hanafi [ 12/Mar/14 ]

This is from a second host that crashed at the same time. It has the call trace

ustre: nbp7-MDT0000-lwp-OST003b: Connection restored to nbp7-MDT0000 (at 10.151.27.38@o2ib)
Lustre: Evicted from MGS (at 10.151.27.38@o2ib) after server handle changed from 0x77507a47a4c430e4 to 0x5127d6849d7f8850
Lustre: nbp7-MDT0000-lwp-OST0033: Connection restored to nbp7-MDT0000 (at 10.151.27.38@o2ib)
Lustre: 47116:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST003f: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 47116:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST003f: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 47117:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST002f: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 47117:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST002f: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: 47122:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST004f: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 47122:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST004f: reconciliation failed. [0x0:0x0:0x0], -37
Lustre: nbp7-OST002b: deleting orphan objects from 0x0:16291625 to 0x0:16291649
Lustre: nbp7-OST003b: deleting orphan objects from 0x0:16000564 to 0x0:16000656
Lustre: nbp7-OST0027: deleting orphan objects from 0x0:16520477 to 0x0:16520523
Lustre: nbp7-OST0033: deleting orphan objects from 0x0:16359404 to 0x0:16359437
Lustre: nbp7-OST0047: deleting orphan objects from 0x0:16321198 to 0x0:16321227
Lustre: nbp7-OST0043: deleting orphan objects from 0x0:16268525 to 0x0:16268561
Lustre: nbp7-OST002f: deleting orphan objects from 0x0:15751870 to 0x0:15751911
Lustre: nbp7-OST003f: deleting orphan objects from 0x0:16136079 to 0x0:16136110
Lustre: nbp7-OST004b: deleting orphan objects from 0x0:15683379 to 0x0:15683400
Lustre: nbp7-OST004f: deleting orphan objects from 0x0:16583505 to 0x0:16583534
Lustre: nbp7-OST0037: deleting orphan objects from 0x0:16228767 to 0x0:16228810
Lustre: nbp7-OST0053: deleting orphan objects from 0x0:16334906 to 0x0:16334966
Lustre: 47127:0:(qsd_reint.c:365:qsd_reconciliation()) nbp7-OST004b: failed to report quota. [0x200000006:0x20000:0x0], -37
Lustre: 47127:0:(qsd_reint.c:525:qsd_reint_main()) nbp7-OST004b: reconciliation failed. [0x0:0x0:0x0], -37
LustreError: 12423:0:(lquota_internal.h:272:lqe_putref()) $$$ Freeing quota entry while it is still referenced in the hash!!!
 qsd:nbp7-OST003f qtype:usr id:9637 enforced:1 granted:2111482660 pending:0 waiting:0 req:0 usage:2111481984 qunit:4096 qtune:1024 edquot:0
LustreError: 12423:0:(lquota_internal.h:273:lqe_putref()) LBUG
Pid: 12423, comm: lquota_wb_nbp7-

Call Trace:
 [<ffffffffa0486895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0486e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0c761ef>] qsd_upd_thread+0xc8f/0xdd0 [lquota]
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20

Entering kdb (current=0xffff881023ab4040, pid 12423) on processor 24 Oops: (null)
due to oops @ 0x0
Comment by Niu Yawei (Inactive) [ 14/Mar/14 ]

I didn't see how quota code screwed the refcount so far, will continue to investigate it further. Johann, do you have any idea? Thanks.

Comment by Niu Yawei (Inactive) [ 17/Mar/14 ]

One thing confused me is that: qsd_reconciliation() reported "failed to report quota for -ENOLCK", which means qsd_reconciliation() called qsd_adjust() and returned -ENOLCK, however I didn't see the corresponding error message from qsd_req_completion() in the log:

        /* despite -EDQUOT & -EINPROGRESS errors, the master might still
         * grant us back quota space to adjust quota overrun */
        if (ret != 0 && ret != -EDQUOT && ret != -EINPROGRESS) {
                if (ret != -ETIMEDOUT && ret != -ENOTCONN &&
                   ret != -ESHUTDOWN && ret != -EAGAIN)
                        /* print errors only if return code is unexpected */
                        LQUOTA_ERROR(lqe, "DQACQ failed with %d, flags:0x%x",
                                     ret, reqbody->qb_flags);
                GOTO(out, ret);
        }
Comment by Niu Yawei (Inactive) [ 20/Mar/14 ]

Hi, Mahmoud

I didn't see why "DQACQ failed with -37" wasn't printed in the log, is it possible that the running code isn't same as 2.4.0-4nasS? Do you have other 2.4 systems are also suffering this problem? Thanks.

As Johann pointed, the error message could probably be flooded. Could you enable quota debug (D_QUOTA) on all OSTs and provide lustre logs as well when it's reproduced next time? Thank you.

Comment by Mahmoud Hanafi [ 20/Mar/14 ]

We have a second filesystem running the same code (2.4.1-5.2nasS) It doesn't have any of the ENOLCK errors. The filesystem with the errors was a 2.1.5 that was upgraded to 2.4. The other started out with 2.4. We do want to track down the ENOLCK errors.

We have D_QUOTA enabled. But when we see a crash the host panics and we are not able to get a debug dump.

Additional info: These past few crashes have been after the MDS crashed and finished recovery. Then we would see 1 or more OSS crash. But we have see this crash without the MDS crash.

I took a current debug dump of all OSS and MDS and uploaded to the ftp site. (nbp7.debug.tgz)

Comment by Shuichi Ihara (Inactive) [ 25/Mar/14 ]

Hi, we hit same issue (filed as LU-4807) of this problem. The following error messages have been showing up every 10 minute, even now. turning on quota debug on OSS and collecting lustre debug might help?

Mar 25 00:29:51 ddnoss4 kernel: LustreError: 4503:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:home2-OST0026 qtype:grp id:3303 enforced:1 granted:7364460 pending:0 waiting:0 req:1 usage:7165364 qunit:4194304 qtune:524288 edquot:0
Mar 25 00:39:51 ddnoss4 kernel: LustreError: 4503:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:home2-OST0026 qtype:grp id:3303 enforced:1 granted:7364460 pending:0 waiting:0 req:1 usage:7165364 qunit:4194304 qtune:524288 edquot:0
Mar 25 00:49:51 ddnoss4 kernel: LustreError: 4503:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:home2-OST0026 qtype:grp id:3303 enforced:1 granted:7364460 pending:0 waiting:0 req:1 usage:7165364 qunit:4194304 qtune:524288 edquot:0
Mar 25 00:59:51 ddnoss4 kernel: LustreError: 4503:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -37, flags:0x2 qsd:home2-OST0026 qtype:gr
Comment by Niu Yawei (Inactive) [ 25/Mar/14 ]

Hi, Ihara

If the OSS is still showing such error message every 10 minutes, collecting debug logs (it's better to have both MDT and OST log with D_QUOTA enabled) can help us to figure out why dqacq failed with -ENOLCK. Thank you.

Comment by Niu Yawei (Inactive) [ 26/Mar/14 ]

Seems there is a small race window in qsd_intent_lock(), I updated the patch.

Comment by Shuichi Ihara (Inactive) [ 26/Mar/14 ]

Hi Niu,

uploaded debug log on ftp.whamcloud.com/uploads/LU-4249
Hope this help for analysis.

Comment by Niu Yawei (Inactive) [ 27/Mar/14 ]

Hi, Ihara/Mahmoud
Could you apply the debug patch (http://review.whamcloud.com/#/c/9808/ ) to try to collect more information when the bug is hit again? (please make sure the autotest passed before apply the patch, and the previous debug patch needs be reverted before apply this one). Thank you.

Comment by Jay Lan (Inactive) [ 27/Mar/14 ]

Niu, do you want me to undo patch at http://review.whamcloud.com/#/c/9070 or to keep it, in addition to #9808? Thanks!

Comment by Jay Lan (Inactive) [ 27/Mar/14 ]

The #9070 needs to be reversed before applying #9808.

However, there are still conflicts in three files. I believe your patch was not generated against 2.4 branch. The tree files with conflicts were:

lustre/quota/qmt_pool.c
lustre/quota/qsd_lock.c
lustre/quota/qsd_writeback.c

One conflict in qmt_pool.c at qmt_pool_lqe_lookup():

if (qid->qid_uid == 0)

{ /* caller wants to access grace time, no need to look up the * entry since we keep a reference on ID 0 all the time */ lqe = pool->qpi_grace_lqe[qtype]; <<<<<<< HEAD lqe_getref(lqe); GOTO(out, 0); ======= lqe_getref(lqe, LQE_REF_IDX_MAX); GOTO(out, lqe); >>>>>>> 6ec5b50... LU-4249 quota: lqe reference debug patch }

/* now that we have the pool, let's look-up the quota entry in the

  • right quota site */

The top of trunk of 2.4 branch shuold have GOTO line like this:
GOTO(out, 0);
but yours is
GOTO(out,lqe);

I think one example is enough for you to see the difference.

Comment by Niu Yawei (Inactive) [ 28/Mar/14 ]

Sorry, I made the patch against master mistakenly, here is the patch for b2_4: http://review.whamcloud.com/9833

This new patch doesn't address the conflict in lustre/quota/qsd_writeback.c, because there is a quota fix introduced in 2.4.2 (LU-3460 http://review.whamcloud.com/#/c/8169/), I suggest you apply this fix first.

Comment by Jay Lan (Inactive) [ 28/Mar/14 ]

Hi Niu, I cherry-picked LU-3460 and then #9833. It applied clean now. Thanks!

Comment by Niu Yawei (Inactive) [ 03/Apr/14 ]

Hi Javed

Looks your post isn't related to this bug, could you open another ticket to track it? Thanks.

Comment by javed shaikh (Inactive) [ 03/Apr/14 ]

thanks niu. deleted from here.

Comment by Shuichi Ihara (Inactive) [ 03/Apr/14 ]

Hi Niu, so, finally, what exactly patches should we apply againt b2_4 if we want enabled debug patches?

Comment by Niu Yawei (Inactive) [ 03/Apr/14 ]

Hi Niu, so, finally, what exactly patches should we apply againt b2_4 if we want enabled debug patches?

fix of LU-3460 http://review.whamcloud.com/#/c/8169/ and debug patch http://review.whamcloud.com/9833

Comment by Shuichi Ihara (Inactive) [ 18/Apr/14 ]

Niu, we applied patches, and then, same error messages have been showing up. Attahced are recent debug log and syslog messages after applies patches.

Comment by Niu Yawei (Inactive) [ 18/Apr/14 ]

Hi, Ihara, the ENOLCK error message problem is addressed in another ticket, please see LU-4920.

This debug patch is to collect information on lqe refcount when system crash on "exception RIP: ...", please keep the debug patch applied until the crash problem reproduced. Thanks.

Comment by Niu Yawei (Inactive) [ 04/Jul/14 ]

http://review.whamcloud.com/10988

Comment by Niu Yawei (Inactive) [ 09/Jul/14 ]

b2_4: http://review.whamcloud.com/11019
b2_5: http://review.whamcloud.com/11020

Comment by Peter Jones [ 08/Aug/14 ]

Landed for 2.7

Comment by Jay Lan (Inactive) [ 08/Aug/14 ]

Do we need to undo #9833 (the debug patch) and apply #11019?
Please advise.

Also please land #11019 and #11020.

Comment by Peter Jones [ 08/Aug/14 ]

I will leave it to Niu to comment definitively about whether the debug patch needs to be removed but my understanding is that #11019 is intended to be the fix for b2_4. The fixes for maintenance branches will be landed when a release is scheduled for those branches.

Comment by Jay Lan (Inactive) [ 08/Aug/14 ]

But wouldn't that cause your engineers who work in future patches have different code base from ours with accumulated b2_4/b2_5 patches?

Comment by Peter Jones [ 08/Aug/14 ]

Well, even without this happening it would still be quite possible that the release you baseline on lags behind the tip of the maintenance branch. If such a situation arises where there are multiple "floating" patches that affect the same area then we can always assist in resolving the conflict. I would think that introducing patch dependencies should make this easier though. It's probably simpler to talk about this on the next call than back and forth in a ticket

Comment by Niu Yawei (Inactive) [ 11/Aug/14 ]

Do we need to undo #9833 (the debug patch) and apply #11019?
Please advise.

Yes, that debug patch a little bit heavier than the former one (but it provides more detailed information), it wasn't supposed to be carried in production system all the time. Thanks.

Generated at Sat Feb 10 06:39:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.