[LU-2821] osd_handler.c:2482:osd_declare_object_ref_del()) ASSERTION( dt_object_exists(dt) && !dt_object_remote(dt) ) failed Created: 15/Feb/13  Updated: 27/Feb/20  Resolved: 27/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5565 (osd_handler.c:1959:osd_attr_set()) A... Resolved
Severity: 3
Rank (Obsolete): 6826

 Description   

Had this assertion failure in replay-single test 26:

[447277.695681] Lustre: DEBUG MARKER: == replay-single test 27: |X| open(O_CREAT), unlink two, replay, close two (test mds_cleanup_orphans) == 04:43:29 (1360921409)
[447278.101651] Turning device loop0 (0x700000) read-only
[447278.145558] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[447278.152329] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[447278.795747] Removing read-only on unknown block (0x700000)
[447290.601469] LDISKFS-fs (loop0): recovery complete
[447290.615599] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[447290.655601] LustreError: 5393:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace MGC192.168.10.211@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup.
[447290.656331] LustreError: 5393:0:(ldlm_resource.c:805:ldlm_resource_complain()) Skipped 1 previous similar message
[447290.656915] LustreError: 5393:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88008def0e78 (111542254400876/0/0/0) (rc: 1)
[447290.657404] LustreError: 5393:0:(ldlm_resource.c:811:ldlm_resource_complain()) Skipped 1 previous similar message
[447290.657864] LustreError: 5393:0:(ldlm_resource.c:1404:ldlm_resource_dump()) --- Resource: ffff88008def0e78 (111542254400876/0/0/0) (rc: 2)
[447292.228086] Lustre: lustre-MDT0000-mdc-ffff88005bca2bf0: Connection restored to lustre-MDT0000 (at 0@lo)
[447292.228559] Lustre: Skipped 23 previous similar messages
[447292.241230] LustreError: 5456:0:(osd_handler.c:2482:osd_declare_object_ref_del()) ASSERTION( dt_object_exists(dt) && !dt_object_remote(dt) ) failed: 
[447292.241794] LustreError: 5456:0:(osd_handler.c:2482:osd_declare_object_ref_del()) LBUG
[447292.242247] Pid: 5456, comm: tgt_recov
[447292.242486] 
[447292.242486] Call Trace:
[447292.242862]  [<ffffffffa0cbd915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[447292.243133]  [<ffffffffa0cbdf17>] lbug_with_loc+0x47/0xb0 [libcfs]
[447292.243393]  [<ffffffffa081ee38>] osd_declare_object_ref_del+0x98/0x130 [osd_ldiskfs]
[447292.243831]  [<ffffffffa098632b>] lod_declare_ref_del+0x3b/0xd0 [lod]
[447292.244094]  [<ffffffffa075fd32>] orph_declare_index_delete+0x112/0x2e0 [mdd]
[447292.244523]  [<ffffffffa0760579>] __mdd_orphan_cleanup+0x679/0xca0 [mdd]
[447292.244828]  [<ffffffffa076f4ad>] mdd_recovery_complete+0xed/0x170 [mdd]
[447292.245135]  [<ffffffffa0894bb5>] mdt_postrecov+0x35/0xd0 [mdt]
[447292.245394]  [<ffffffffa0896638>] mdt_obd_postrecov+0x78/0x90 [mdt]
[447292.245703]  [<ffffffffa1257f90>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
[447292.246005]  [<ffffffffa1268b01>] target_recovery_thread+0xb31/0x1610 [ptlrpc]
[447292.246449]  [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc]
[447292.246866]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[447292.247113]  [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc]
[447292.247552]  [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc]
[447292.247971]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[447292.248208] 
[447292.304220] Kernel panic - not syncing: LBUG

Crashdump and modules in /exports/crashdumps/192.168.10.211-2013-02-15-04\:43\:51



 Comments   
Comment by Di Wang [ 16/Feb/13 ]

According to the debug log, clearly there is a race of orphan cleanup between mdd_close and mdd_recovery_complete

00000004:00080000:3.0:1360921424.522237:0:5430:0:(mdd_object.c:1966:mdd_close()) Object [0x280001b71:0x13:0x0] is deleted from orphan list, OSS objects to be destroyed.
00000040:00100000:3.0:1360921424.522262:0:5430:0:(llog_osd.c:465:llog_osd_write_rec()) added record 0x53: idx: 2, 64
00000100:00100000:3.0:1360921424.522362:0:5430:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg00_001:8af42fe9-8c1e-88fd-452d-5295dd97c2d5+7:5216:x1427028696914976:12345-0@lo:35 Request procesed in 2989us (3023us total) trans 158913789954 rc 0/0
00000100:00100000:3.0:1360921424.522366:0:5430:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-0@lo, seq: 0
00000100:00100000:3.0:1360921424.522402:0:5216:0:(client.c:1803:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc multiop:8af42fe9-8c1e-88fd-452d-5295dd97c2d5:5216:1427028696914976:0@lo:35
00000080:00200000:0.0:1360921424.522740:0:5219:0:(llite_lib.c:1416:ll_setattr_raw()) lustre: setattr inode ffff88008e1b3b08/fid:[0x280001b71:0x14:0x0] from 0 to 18446612134838689624, valid 41
00000004:00080000:2.0:1360921424.522756:0:5456:0:(mdd_orphans.c:394:orph_key_test_and_del()) Found orphan [0x280001b71:0x13:0x0], delete it
Comment by Mikhail Pershin [ 23/Feb/13 ]

Looks like another assertion that may be not correct because of locking is not around whole transaction but execution now.

Comment by Oleg Drokin [ 07/Oct/13 ]

I just hit this once more.
Where, how about having some patches for the locking here?

Comment by Stephen Champion [ 08/Jan/14 ]

I have hit what appears to match this bug while running acceptance on my modified 2.4.1 with a rebuild of e2fsprogs-1.42.7.wc2. The MDS trips while running large-scale 3a.

Build Version: sgi241a18.rhel6---CHANGED-2.6.32-358.23.2.el6sgi241a18.rhel6.x86_64.lustre
LNet: Added LNI 192.168.21.13@o2ib1 [8/256/0/180]
LDISKFS-fs (sdc): recovery complete
LDISKFS-fs (sdc): mounted filesystem with ordered data mode. quota=on. Opts:
Lustre: accfs1-MDT0000: used disk, loading
LustreError: 11-0: accfs1-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: accfs1-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
LustreError: 11488:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880322817000 x1456290811566604/t0(8589934684) o101->83ac397d-ea05-8390-b7e2-d950b803713b@192.168.21.6@o2ib1:0/0 lens 568/1136 e 0 to 0 dl 1388882068 ref 1 fl Interpret:/4/0 rc 0/0
LustreError: 11488:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880320d09800 x1456290811566608/t0(8589934685) o101->83ac397d-ea05-8390-b7e2-d950b803713b@192.168.21.6@o2ib1:0/0 lens 568/1136 e 0 to 0 dl 1388882068 ref 1 fl Interpret:/4/0 rc 0/0
Lustre: accfs1-MDT0000: Recovery over after 0:02, of 2 clients 2 recovered and 0 were evicted.
LustreError: 11540:0:(osd_handler.c:2609:osd_declare_object_ref_del()) ASSERTION( dt_object_exists(dt) && !dt_object_remote(dt) ) failed:
LustreError: 11540:0:(osd_handler.c:2609:osd_declare_object_ref_del()) LBUG
Pid: 11540, comm: tgt_recov

Call Trace:
[<ffffffffa07a9895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa07a9e97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa1065289>] osd_declare_object_ref_del+0x99/0x130 [osd_ldiskfs]
[<ffffffffa121f06b>] lod_declare_ref_del+0x3b/0xd0 [lod]
[<ffffffffa0f096e2>] orph_declare_index_delete+0x112/0x2e0 [mdd]
[<ffffffffa0f09f29>] __mdd_orphan_cleanup+0x679/0xca0 [mdd]
[<ffffffffa0f1937d>] mdd_recovery_complete+0xed/0x170 [mdd]
[<ffffffffa114acb5>] mdt_postrecov+0x35/0xd0 [mdt]
[<ffffffffa114c178>] mdt_obd_postrecov+0x78/0x90 [mdt]
[<ffffffffa0a49530>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
[<ffffffffa0a448ee>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
[<ffffffffa0a59c54>] target_recovery_thread+0xc74/0x1970 [ptlrpc]
[<ffffffffa0a58fe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20

This failure only occurs following a failure in the lfsck tests. If lfsck tests do not run,
(and fail) this test passes. performance-sanity may run successfully between the tests with no impact on the crash. The lfsck tests conclude with:

I/O read: 8MB, write: 0MB, rate: 21.07MB/s
lfsck : @@@@@@ FAIL: e2fsck -d -v -t -t -f -n --mdsdb /data1/cluster-tools/out/1041154/20140107.125659/mdsdb /dev/sdc returned 4, should be <= 1

I have a pile of the resulting cores and logs, and can make them available.

Comment by Andreas Dilger [ 27/Feb/20 ]

Close old bug that hasn't been seen in a long time.

Generated at Sat Feb 10 01:28:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.