[LU-2821] osd_handler.c:2482:osd_declare_object_ref_del()) ASSERTION( dt_object_exists(dt) && !dt_object_remote(dt) ) failed Created: 15/Feb/13 Updated: 27/Feb/20 Resolved: 27/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6826 | ||||||||
| Description |
|
Had this assertion failure in replay-single test 26: [447277.695681] Lustre: DEBUG MARKER: == replay-single test 27: |X| open(O_CREAT), unlink two, replay, close two (test mds_cleanup_orphans) == 04:43:29 (1360921409) [447278.101651] Turning device loop0 (0x700000) read-only [447278.145558] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [447278.152329] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [447278.795747] Removing read-only on unknown block (0x700000) [447290.601469] LDISKFS-fs (loop0): recovery complete [447290.615599] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [447290.655601] LustreError: 5393:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace MGC192.168.10.211@tcp resource refcount nonzero (1) after lock cleanup; forcing cleanup. [447290.656331] LustreError: 5393:0:(ldlm_resource.c:805:ldlm_resource_complain()) Skipped 1 previous similar message [447290.656915] LustreError: 5393:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88008def0e78 (111542254400876/0/0/0) (rc: 1) [447290.657404] LustreError: 5393:0:(ldlm_resource.c:811:ldlm_resource_complain()) Skipped 1 previous similar message [447290.657864] LustreError: 5393:0:(ldlm_resource.c:1404:ldlm_resource_dump()) --- Resource: ffff88008def0e78 (111542254400876/0/0/0) (rc: 2) [447292.228086] Lustre: lustre-MDT0000-mdc-ffff88005bca2bf0: Connection restored to lustre-MDT0000 (at 0@lo) [447292.228559] Lustre: Skipped 23 previous similar messages [447292.241230] LustreError: 5456:0:(osd_handler.c:2482:osd_declare_object_ref_del()) ASSERTION( dt_object_exists(dt) && !dt_object_remote(dt) ) failed: [447292.241794] LustreError: 5456:0:(osd_handler.c:2482:osd_declare_object_ref_del()) LBUG [447292.242247] Pid: 5456, comm: tgt_recov [447292.242486] [447292.242486] Call Trace: [447292.242862] [<ffffffffa0cbd915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [447292.243133] [<ffffffffa0cbdf17>] lbug_with_loc+0x47/0xb0 [libcfs] [447292.243393] [<ffffffffa081ee38>] osd_declare_object_ref_del+0x98/0x130 [osd_ldiskfs] [447292.243831] [<ffffffffa098632b>] lod_declare_ref_del+0x3b/0xd0 [lod] [447292.244094] [<ffffffffa075fd32>] orph_declare_index_delete+0x112/0x2e0 [mdd] [447292.244523] [<ffffffffa0760579>] __mdd_orphan_cleanup+0x679/0xca0 [mdd] [447292.244828] [<ffffffffa076f4ad>] mdd_recovery_complete+0xed/0x170 [mdd] [447292.245135] [<ffffffffa0894bb5>] mdt_postrecov+0x35/0xd0 [mdt] [447292.245394] [<ffffffffa0896638>] mdt_obd_postrecov+0x78/0x90 [mdt] [447292.245703] [<ffffffffa1257f90>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] [447292.246005] [<ffffffffa1268b01>] target_recovery_thread+0xb31/0x1610 [ptlrpc] [447292.246449] [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc] [447292.246866] [<ffffffff8100c14a>] child_rip+0xa/0x20 [447292.247113] [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc] [447292.247552] [<ffffffffa1267fd0>] ? target_recovery_thread+0x0/0x1610 [ptlrpc] [447292.247971] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [447292.248208] [447292.304220] Kernel panic - not syncing: LBUG Crashdump and modules in /exports/crashdumps/192.168.10.211-2013-02-15-04\:43\:51 |
| Comments |
| Comment by Di Wang [ 16/Feb/13 ] |
|
According to the debug log, clearly there is a race of orphan cleanup between mdd_close and mdd_recovery_complete 00000004:00080000:3.0:1360921424.522237:0:5430:0:(mdd_object.c:1966:mdd_close()) Object [0x280001b71:0x13:0x0] is deleted from orphan list, OSS objects to be destroyed. 00000040:00100000:3.0:1360921424.522262:0:5430:0:(llog_osd.c:465:llog_osd_write_rec()) added record 0x53: idx: 2, 64 00000100:00100000:3.0:1360921424.522362:0:5430:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg00_001:8af42fe9-8c1e-88fd-452d-5295dd97c2d5+7:5216:x1427028696914976:12345-0@lo:35 Request procesed in 2989us (3023us total) trans 158913789954 rc 0/0 00000100:00100000:3.0:1360921424.522366:0:5430:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-0@lo, seq: 0 00000100:00100000:3.0:1360921424.522402:0:5216:0:(client.c:1803:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc multiop:8af42fe9-8c1e-88fd-452d-5295dd97c2d5:5216:1427028696914976:0@lo:35 00000080:00200000:0.0:1360921424.522740:0:5219:0:(llite_lib.c:1416:ll_setattr_raw()) lustre: setattr inode ffff88008e1b3b08/fid:[0x280001b71:0x14:0x0] from 0 to 18446612134838689624, valid 41 00000004:00080000:2.0:1360921424.522756:0:5456:0:(mdd_orphans.c:394:orph_key_test_and_del()) Found orphan [0x280001b71:0x13:0x0], delete it |
| Comment by Mikhail Pershin [ 23/Feb/13 ] |
|
Looks like another assertion that may be not correct because of locking is not around whole transaction but execution now. |
| Comment by Oleg Drokin [ 07/Oct/13 ] |
|
I just hit this once more. |
| Comment by Stephen Champion [ 08/Jan/14 ] |
|
I have hit what appears to match this bug while running acceptance on my modified 2.4.1 with a rebuild of e2fsprogs-1.42.7.wc2. The MDS trips while running large-scale 3a. Build Version: sgi241a18.rhel6---CHANGED-2.6.32-358.23.2.el6sgi241a18.rhel6.x86_64.lustre Call Trace: This failure only occurs following a failure in the lfsck tests. If lfsck tests do not run, I/O read: 8MB, write: 0MB, rate: 21.07MB/s I have a pile of the resulting cores and logs, and can make them available. |
| Comment by Andreas Dilger [ 27/Feb/20 ] |
|
Close old bug that hasn't been seen in a long time. |