[LU-13195] replay-single test_118: dt_declare_record_write() ASSERTION( dt->do_body_ops ) failed Created: 04/Feb/20  Updated: 06/Jan/23  Resolved: 09/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.15.0, Lustre 2.12.10

Type: Bug Priority: Major
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-10143 LBUG dt_object.h:2166:dt_declare_reco... Resolved
is duplicated by LU-13295 replay-single test_118: ASSERTION( dt... Resolved
Related
is related to LU-13411 processing of update logs couldn't de... Resolved
is related to LU-15769 llog_osd_next_block()): invalid llog ... In Progress
is related to LU-13469 MDS hung during mount Resolved
is related to LU-15139 sanity test_160h: dt_record_write() A... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for liuying <emoly.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ca28353a-46ca-11ea-91a9-52540065bddc

test_118 failed with the following error:

== replay-single test 118: invalidate osp update will not cause update log corruption ================ 17:21:23 (1580750483)
CMD: trevis-19vm4 lctl set_param fail_loc=0x1705

and the following stack trace on the console:

[10146.524712] LustreError: 19994:0:(dt_object.h:2191:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: 
[10146.525745] LustreError: 19994:0:(dt_object.h:2191:dt_declare_record_write()) LBUG
[10146.526542] Pid: 19994, comm: mdt_out00_000 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Sat Jan 18 23:01:59 UTC 2020
[10146.527632] Call Trace:
[10146.527905]  [<ffffffffc0c348ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[10146.528605]  [<ffffffffc0c3495c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[10146.529291]  [<ffffffffc10f65eb>] out_write_add_exec+0x13b/0x1b0 [ptlrpc]
[10146.530275]  [<ffffffffc10eed43>] out_write+0x333/0x370 [ptlrpc]
[10146.530971]  [<ffffffffc10f1086>] out_handle+0x1566/0x1bb0 [ptlrpc]
[10146.531652]  [<ffffffffc10e7eca>] tgt_request_handle+0x95a/0x1610 [ptlrpc]
[10146.532417]  [<ffffffffc108b816>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[10146.533234]  [<ffffffffc108f8a4>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]

This issue happened several times in Maloo testing but no more logs were collected.
https://testing.whamcloud.com/sub_tests/121c5288-447b-11ea-bffa-52540065bddc
https://testing.whamcloud.com/sub_tests/7cefb86c-4362-11ea-86b2-52540065bddc

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-single test_118 - trevis-19vm4 crashed during replay-single test_118



 Comments   
Comment by Andreas Dilger [ 07/Feb/20 ]

This looks like it is the same as LU-10143.

Comment by Gu Zheng (Inactive) [ 23/Mar/20 ]

Another instance:
https://testing.whamcloud.com/test_sets/2ee75f70-b67b-447d-ba2a-db2f67c0bd0f

Comment by Alex Zhuravlev [ 09/Apr/20 ]

looking at the logs from sanity with the same symptom it looks like sometime llog cookie is shared by transactions:

1586485804.112492:0:13975:0:(update_trans.c:75:top_multiple_thandle_dump()) lustre-MDT0001-osd tmt 000000004e86531f refcount 1 committed 1 result -5 batchid 17179870236
1586485804.112493:0:13975:0:(update_trans.c:85:top_multiple_thandle_dump()) st 000000001da13291 obd lustre-MDT0000-osp-MDT0001 committed 1 started 1 stopped 1 result -5 sub_th 000000003da75353
1586485804.112494:0:13975:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x2341:0x2].873
1586485804.112495:0:13975:0:(update_trans.c:85:top_multiple_thandle_dump()) st 000000000cc313ae obd lustre-MDT0001-osd committed 1 started 1 stopped 1 result 0 sub_th 000000000764ad4c
1586485804.112497:0:13975:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x40002340:0x2].921
1586485804.112497:0:13975:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x40002340:0x2].921

...
1586485804.112723:0:13975:0:(update_trans.c:75:top_multiple_thandle_dump()) lustre-MDT0001-osd tmt 0000000046070051 refcount 1 committed 1 result 0 batchid 17179870237
1586485804.112725:0:13975:0:(update_trans.c:85:top_multiple_thandle_dump()) st 00000000a7f7ee4f obd lustre-MDT0000-osp-MDT0001 committed 1 started 1 stopped 1 result 0 sub_th 00000000222fbaf6
1586485804.112726:0:13975:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x2341:0x2].873
1586485804.112727:0:13975:0:(update_trans.c:85:top_multiple_thandle_dump()) st 00000000adeff453 obd lustre-MDT0001-osd committed 1 started 1 stopped 1 result 0 sub_th 0000000090f0b368
1586485804.112728:0:13975:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x40002340:0x2].922

notice cookie [0x2:0x2341:0x2].873

the consequence is that cancellation of the first record (873) causes destroy of corresponding llog file and it's not supposed to be referenced any more, but then recovery process finds cookie duplicate and tries to repeat cancellation/destroy.
while it's possible to handle this (checking whether the llog does exist) I think it's important to understand the root cause for duplicates.

Comment by Alex Zhuravlev [ 10/Apr/20 ]

in another case it was the same transaction being cancelled twice in two different threads:

.121703:0:5542:0:(update_trans.c:75:top_multiple_thandle_dump()) lustre-MDT0001-osd tmt 00000000cb6dafd8 refcount 2 committed 1 result 0 batchid 17179870235
.121705:0:5542:0:(update_trans.c:85:top_multiple_thandle_dump()) st 0000000084a8ca02 obd lustre-MDT0000-osp-MDT0001 committed 1 started 0 stopped 0 result 0 sub_th           (null)
.121706:0:5542:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x2341:0x2].871
.121707:0:5542:0:(update_trans.c:85:top_multiple_thandle_dump()) st 00000000a1d7d236 obd lustre-MDT0001-osd committed 1 started 0 stopped 0 result 0 sub_th           (null)
.121708:0:5542:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x40002340:0x2].902

.121919:0:5539:0:(update_trans.c:75:top_multiple_thandle_dump()) lustre-MDT0001-osd tmt 00000000cb6dafd8 refcount 1 committed 1 result 0 batchid 17179870235
.121920:0:5539:0:(update_trans.c:85:top_multiple_thandle_dump()) st 0000000084a8ca02 obd lustre-MDT0000-osp-MDT0001 committed 1 started 0 stopped 0 result 0 sub_th           (null)
.121922:0:5539:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x2341:0x2].871
.121923:0:5539:0:(update_trans.c:85:top_multiple_thandle_dump()) st 00000000a1d7d236 obd lustre-MDT0001-osd committed 1 started 0 stopped 0 result 0 sub_th           (null)
.121924:0:5539:0:(update_trans.c:92:top_multiple_thandle_dump())  cookie [0x2:0x40002340:0x2].902

dist_txn-1 D 0 5539 2 0x80000000
Call Trace:
? __schedule+0x2ad/0xb00
schedule+0x34/0x80
lbug_with_loc+0x79/0x80 [libcfs]
? osp_md_write+0x4c3/0x540 [osp]
? dt_record_write+0x2a/0x110 [obdclass]
? llog_osd_write_rec+0x717/0x1dd0 [obdclass]
? llog_write_rec+0x36a/0x4e0 [obdclass]
? llog_cancel_arr_rec+0x746/0xb20 [obdclass]
? llog_cat_cancel_arr_rec+0x143/0x400 [obdclass]
? llog_cat_cancel_records+0x4f/0x1a0 [obdclass]
? distribute_txn_commit_thread+0x580/0x11d0 [ptlrpc]
? __kthread_parkme+0x39/0x80

00000100:00100000:1.0:1586499582.124714:0:5542:0:(client.c:1697:ptlrpc_send_new_req()) Sending RPC req@0000000053ab4ec6 pname:cluuid:pid:xid:nid:opc:job tgt_recover_1:lustre-MDT0001-mdtlov_UUID:5542:1663564960200128:192.168.122.97@tcp:1000:tgt_recover_1.0

Comment by Andreas Dilger [ 10/Apr/20 ]

Is the mechanism for generating this number properly locked? Are you sure that it is being printed consistently?

Comment by Alex Zhuravlev [ 12/Apr/20 ]

sorry, that was wrong understanding. I've got couple more logs and it's actually not duplicated id, but rather correct (in some sense) logic:
llog is destroyed because the last record is just cancelled. normally this doesn't happen because this log is current, but after recovery it's not current anymore as we usually create a new llog (iirc).
there is no current llog during recovery as we don't generate new llog records (though we can write records from update log), so current llog is NULL at recovery, this can be used to skip llog destroy, but then tricky thing is when to destroy those. will try to catch this..

Comment by Alex Zhuravlev [ 13/Apr/20 ]

so far the change (postponing llog destroy) has been running for many hours and I saw this condition happeing sometimes.
trying to construct a scripted test for the case..

Comment by Alex Zhuravlev [ 17/Apr/20 ]

one identified issue is a race between llog append/cancel and invalidataion. basically any change to llog involves two changes at least: llh_count and bitmap (plus new record in case of append). to save on RPC size we modify llh_count and bitmap using two separate dt_write(). sometimes invalidation comes in between these two writes and one of them don't get into the transaction, but the transaction itself proceeds with just the first write.
I think in this case llog should invalidate internal state (bitmap, counters) and re-initialize it (in this call or in the next one), the transaction then must be marked properly and aborted as a whole. working on a patch..
this is not the only root cause for the issue, AFAICS. there is another one (or few) leading to similar symptoms.

Comment by Alex Zhuravlev [ 20/Apr/20 ]

can't reproduce reliable, but this is my reconstruction: for a reason OSP invalidate failed request and all subsequent requests in the sending queue (osp_send_update_thread() -> osp_invalidate_request()), which basically should invalidate all cached internal states like llog's bitmaps.
all new transactions will notice that and refresh llog's data, but all already-declared transactions will still proceed. this is quite short window given everything is running in cache, all locks have been acquired already, etc, but still possible.
then depending on luck the llog can be corrupted different ways (in majority of cases - llh_count is less than actual number of records, leading to early llog destroy)
I'm trying to construct a test, but not sure how to bind two events: request invalidation and resuming a suspended-after-declaration transaction.

Comment by Alex Zhuravlev [ 26/Apr/20 ]

few different issues have been identified so far:

  • llog_cat_refresh() causes duplicated llog ids - if some write failed, then to reinitialize llog state the master MDS re-reads remote llog and gets old lgh_last_idx, which has been already returned to another transaction
  • object invalidation doesn't work in one case at least - osp_md_write() doesn't put on object on invalidation list for subsequent transaction, then a race possible which in turn results in lost -ESTALE and then lgh_count won't match actual number of records
  • partial writes - I've seen this in the logs, but not sure about exact sequence leading to this
Comment by Gerrit Updater [ 27/Apr/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38385
Subject: LU-13195 osp: track destroyed OSP object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52bc4e0824a185e05551a3d95bb721c61baa6b40

Comment by Alexander Boyko [ 27/Apr/20 ]

Alex, I have a draft fix for invalidation. I'm pushing it to gerrit today.

Comment by Alex Zhuravlev [ 27/Apr/20 ]

Alex, I have a draft fix for invalidation. I'm pushing it to gerrit today.

for xattr_set or for more ops? write path misses invalidation as well, I did that by adding an interpreter for osp_md_write() which doesn't depend on invalidation list.

Comment by Alexander Boyko [ 27/Apr/20 ]

Fix for any operation.

Comment by Gerrit Updater [ 27/Apr/20 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38386
Subject: LU-13195 osp: invalidate objects for error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1107a36f44fd2d8c195c9bee5ad3db565efe766e

Comment by Gerrit Updater [ 27/Apr/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38387
Subject: LU-13195 osp: invalidate object on write error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 40eb50739b4f836cc6161e0efc866060cb267ac8

Comment by Alexander Boyko [ 27/Apr/20 ]

Alex, I'm using the next debug patch to catch llog corruption

diff --git a/lustre/target/out_lib.c b/lustre/target/out_lib.c
index e8bdfd1..9eac332 100644
--- a/lustre/target/out_lib.c
+++ b/lustre/target/out_lib.c
@@ -697,6 +697,20 @@ static int out_tx_write_exec(const struct lu_env *env, struct thandle *th,
        if (OBD_FAIL_CHECK(OBD_FAIL_OUT_ENOSPC)) {
                rc = -ENOSPC;
        } else {
+               struct lu_attr la;
+               unsigned int *buf = (unsigned int*) arg->u.write.buf.lb_buf;
+               dt_attr_get(env, dt_obj, &la);
+
+               if (arg->u.write.buf.lb_len > LLOG_MIN_REC_SIZE &&
+                   __le32_to_cpu(buf[2]) == UPDATE_REC &&
+                   arg->u.write.pos > la.la_size) {
+               CDEBUG(D_HA, "write with sparse "DFID" pos %llu buf %p, len %lu, size %llu\n",
+               PFID(lu_object_fid(&dt_obj->do_lu)), arg->u.write.pos,
+               arg->u.write.buf.lb_buf, (unsigned long)arg->u.write.buf.lb_len,
+               la.la_size);
+
+                       LBUG();
+               }
                dt_write_lock(env, dt_obj, MOR_TGT_CHILD);
                rc = dt_record_write(env, dt_obj, &arg->u.write.buf,
                                     &arg->u.write.pos, th);
Comment by Alex Zhuravlev [ 27/Apr/20 ]

thanks for sharing. I used this one (to catch on remote side):

diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
index 8d8d8b4d7e..318e5aa4dc 100644
--- a/lustre/osd-ldiskfs/osd_handler.c
+++ b/lustre/osd-ldiskfs/osd_handler.c
@@ -1788,6 +1788,8 @@ static struct thandle *osd_trans_create(const struct lu_env *env,
 
 	oti->oti_ins_cache_depth++;
 
+	oh->ot_bh = NULL;
+
 	RETURN(th);
 }
 
@@ -1967,6 +1969,32 @@ static void osd_trans_stop_cb(struct osd_thandle *oth, int result)
 	}
 }
 
+void osd_llog_hdr_check(const struct lu_fid *fid, void *buf)
+{
+	struct llog_log_hdr *llh = buf;
+	int found = 0;
+	int i = 0;
+	int max;
+	int last = -1, first = -1;
+
+	max = 128 * 8;
+	while (i < max) {
+		i = find_next_bit((unsigned  long *)LLOG_HDR_BITMAP(llh), max, i);
+		if (i >= max)
+			break;
+		if (first <= 0)
+			first = i;
+		last = i;
+		i++;
+		found++;
+	}
+	CDEBUG(D_HA, "llog osd_write "DFID", llh_count=%d, found %d, first %d, last %d\n",
+		PFID(fid), (int)llh->llh_count, found, first, last);
+	LASSERTF(llh->llh_count == found,
+		"llog osd_write "DFID", llh_count=%d, found %d, first %d, last %d\n",
+		PFID(fid), (int)llh->llh_count, found, first, last);
+}
+
 /*
  * Concurrency: shouldn't matter.
  */
@@ -1986,6 +2014,12 @@ static int osd_trans_stop(const struct lu_env *env, struct dt_device *dt,
 
 	oh = container_of0(th, struct osd_thandle, ot_super);
 
+	if (oh->ot_bh) {
+		osd_llog_hdr_check(&oh->ot_fid, oh->ot_bh->b_data);
+		brelse(oh->ot_bh);
+		oh->ot_bh = NULL;
+	}
+
 	remove_agents = oh->ot_remove_agents;
 
 	qtrans = oh->ot_quota_trans;
diff --git a/lustre/osd-ldiskfs/osd_internal.h b/lustre/osd-ldiskfs/osd_internal.h
index 6c62ca450b..8aedc5e9ff 100644
--- a/lustre/osd-ldiskfs/osd_internal.h
+++ b/lustre/osd-ldiskfs/osd_internal.h
@@ -426,6 +426,8 @@ struct osd_thandle {
 	ktime_t oth_started;
 #endif
 	struct list_head	ot_trunc_locks;
+	struct buffer_head	*ot_bh;
+	struct lu_fid		ot_fid;
 };
 
 /**
diff --git a/lustre/osd-ldiskfs/osd_io.c b/lustre/osd-ldiskfs/osd_io.c
index ce2af335ad..6563e8bd2b 100644
--- a/lustre/osd-ldiskfs/osd_io.c
+++ b/lustre/osd-ldiskfs/osd_io.c
@@ -1680,9 +1680,11 @@ static int osd_ldiskfs_writelink(struct inode *inode, char *buffer, int buflen)
 	return 0;
 }
 
+void osd_llog_hdr_check(const struct lu_fid *fid, void *buf);
+
 static int osd_ldiskfs_write_record(struct dt_object *dt, void *buf,
 				    int bufsize, int write_NUL, loff_t *offs,
-				    handle_t *handle)
+				    handle_t *handle, struct osd_thandle *oth)
 {
 	struct inode *inode = osd_dt_obj(dt)->oo_inode;
         struct buffer_head *bh        = NULL;
@@ -1786,6 +1788,25 @@ static int osd_ldiskfs_write_record(struct dt_object *dt, void *buf,
 				sync = false;
 			}
 		}
+		if (block == 0 && bh && !create) {
+			const struct lu_fid *fid = lu_object_fid(&dt->do_lu);
+			if (fid_seq(fid) == 0x200002341 || fid_seq(fid) == 0x200002342) {
+				if (size == 4) {
+					unsigned char *bb = buf;
+					CDEBUG(D_HA, "write %d @ %d: %02x %02x %02x %02x\n",
+						(int)size, boffs, bb[0], bb[1], bb[2], bb[3]);
+				}
+
+				if (oth->ot_bh == NULL) {
+					get_bh(bh);
+					oth->ot_fid = *fid;
+					oth->ot_bh = bh;
+					osd_llog_hdr_check(fid, bh->b_data);
+				} else {
+					LASSERT(oth->ot_bh == bh);
+				}
+			}
+		}
 		memcpy(bh->b_data + boffs, buf, size);
 		err = ldiskfs_handle_dirty_metadata(handle, NULL, bh);
                 if (err)
@@ -1856,7 +1878,7 @@ static ssize_t osd_write(const struct lu_env *env, struct dt_object *dt,
 		result = osd_ldiskfs_writelink(inode, buf->lb_buf, buf->lb_len);
 	else
 		result = osd_ldiskfs_write_record(dt, buf->lb_buf, buf->lb_len,
-						  is_link, pos, oh->ot_handle);
+						  is_link, pos, oh->ot_handle, oh);
 	if (result == 0)
 		result = buf->lb_len;
Comment by Gerrit Updater [ 10/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38387/
Subject: LU-13195 osp: invalidate object on write error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e1071b517578ed3752efb1412017c8f93cd333b

Comment by Gerrit Updater [ 18/Jun/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38977
Subject: LU-13195 obdclass: show FID for corrupted llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b381c6b220c888b7eacd0abf947add60bbf11231

Comment by Andreas Dilger [ 03/Jul/20 ]

+1 on master with this patch applied https://review.whamcloud.com/38387 "LU-13195 osp: invalidate object on write error"
https://testing.whamcloud.com/test_sets/16bbb988-9976-40c3-86d3-e673eec6e820

Comment by Chris Horn [ 12/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/3d28761f-80fe-402e-9d21-53f977d88d40

Comment by Gerrit Updater [ 19/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38977/
Subject: LU-13195 obdclass: show FID for corrupted llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b74e0466edbfbbf9f336171de1adc5e583e9475

Comment by Artem Blagodarenko (Inactive) [ 11/Dec/20 ]

+1 https://testing.whamcloud.com/test_sets/eca8222e-939a-437a-be41-0df8325c411d

Comment by James Nunez (Inactive) [ 07/Jan/21 ]

Alex -

Is this the same issue https://testing.whamcloud.com/test_sets/dfff72b1-060d-4d53-b15b-b8a989d5c8d5 ?

Comment by Emoly Liu [ 20/Jan/21 ]

+1 on master: https://testing.whamcloud.com/test_sets/757f2536-f148-46a7-8992-2ddef45293bf

Comment by Andreas Dilger [ 27/Jan/21 ]

+1 on master https://testing.whamcloud.com/test_sets/27e54b38-cf46-466f-8fec-4cf25e9bb57a

Lustre: DEBUG MARKER: == replay-single test 118: invalidate osp update will not cause update log corruption
Lustre: DEBUG MARKER: lctl set_param fail_loc=0x1705
Lustre: *** cfs_fail_loc=1705, val=0***
Lustre: Skipped 1 previous similar message
LustreError: 760439:0:(dt_object.h:2274:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x20003ccc2:0x1:0x0] doesn't exit
Comment by Andreas Dilger [ 03/Aug/21 ]

+1 on master https://testing.whamcloud.com/test_sets/cb680395-862a-482a-98bb-c7b3da79cf68

Comment by Andreas Dilger [ 03/Aug/21 ]

Alex, is patch https://review.whamcloud.com/38385 "LU-13195 osp: track destroyed OSP object" expected to fix this problem, or is that more of a diagnostic patch?

This subtest crashed about 10 of 200 runs in the past week.

Comment by Etienne Aujames [ 03/Sep/21 ]

It seems also to happen on runtest test_1:
https://testing.whamcloud.com/test_sets/842d914c-2cce-427d-a5a7-4aadb1a36d21

Comment by Andreas Dilger [ 04/Sep/21 ]

This seems to be failing fairly often on master recently - "15.15% of most recent 66 runs, 34 skipped (all branches)"

Comment by Alex Zhuravlev [ 04/Sep/21 ]

yes, looking at this one in background..

Comment by Gerrit Updater [ 06/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44853
Subject: LU-13195 obdclass: catch llog corruption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 513265c887b42247b93c23efe8e8d17c84013841

Comment by Gerrit Updater [ 17/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38385/
Subject: LU-13195 osp: track destroyed OSP object
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f5a8f1bcf5563f96cf6ba0e5de5a99a1ea524cc6

Comment by Peter Jones [ 17/Sep/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 20/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44993
Subject: LU-13195 osp: object creation needs a version
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a79be4c9521b4c35f5efcf813350e7097cfd4210

Comment by Alex Zhuravlev [ 22/Sep/21 ]

this still happen to master. my understanding is that eviction in 118 can cause different operations to abort. the last few times I checked it was llog object creation invalidated. I'm working on a reliable reproducer.

Comment by Gerrit Updater [ 24/Sep/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45042
Subject: LU-13195 obdclass: check another theory
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3affa067b1ff7a5a4577eadbd269ae300ccb22cd

Comment by Alex Zhuravlev [ 28/Sep/21 ]

with the latest https://review.whamcloud.com/45042 I can't reproduce LBUG() anymore. basically it's a race - failed (by intention) create doesn't invalidate request-in-progress properly, then that survived request (containing a write to object just failed to create) flies to remote MDT and we get the LBUG().
with LBUG resolved I observed another issue - few OSP structures from that inproperly invalidated request can leak. I think this is slightly different issue and plan to fix that with another patch.

Comment by Gerrit Updater [ 19/Oct/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45042/
Subject: LU-13195 osp: osp_send_update_req() should check generation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dff1e0d21c8c6bb20d63669252190795198bc49f

Comment by Peter Jones [ 19/Oct/21 ]

So... complete it seems

Comment by Etienne Aujames [ 25/Oct/21 ]

+1 in runtest (with all the fix patches):
https://testing.whamcloud.com/test_sets/7110dd40-b540-469d-a773-874769fe527b

[ 4327.451873] Lustre: DEBUG MARKER: copying 607 files from /etc /bin to /mnt/lustre/d1.runtests/etc /bin at Fri Oct 22 22:51:39 UTC 2021
[ 4334.578564] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [0x200011571:0x1:0x0] doesn't exit
[ 4334.581584] LustreError: 12949:0:(dt_object.h:2310:dt_declare_record_write()) LBUG
[ 4334.583011] Pid: 12949, comm: mdt_out00_001 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Mon Oct 4 16:46:22 UTC 2021
[ 4334.585079] Call Trace TBD:
[ 4334.585913] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[ 4334.586882] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
[ 4334.588281] [<0>] out_write_add_exec+0x17d/0x1e0 [ptlrpc]
[ 4334.589374] [<0>] out_write+0x166/0x380 [ptlrpc]
[ 4334.590282] [<0>] out_handle+0x16af/0x20e0 [ptlrpc]
[ 4334.591293] [<0>] tgt_request_handle+0xc93/0x1a00 [ptlrpc]
[ 4334.592391] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[ 4334.593680] [<0>] ptlrpc_main+0xc06/0x1550 [ptlrpc]
[ 4334.594667] [<0>] kthread+0x112/0x130
[ 4334.595389] [<0>] ret_from_fork+0x35/0x40
Comment by Alex Zhuravlev [ 25/Oct/21 ]

damn... looking

Comment by Andreas Dilger [ 09/Nov/21 ]

Sorry, new failure was LU-15139, which is similar, but not identical.

Comment by Gerrit Updater [ 18/Mar/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46863
Subject: LU-13195 osp: invalidate object on write error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a6555d5b7f9a76250e8460adcb3d8a089356f490

Comment by Gerrit Updater [ 07/Apr/22 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47010
Subject: LU-13195 osp: osp_send_update_req() should check generation
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: e555f223a6ad340e7f326478cd09ba36b4b8bbb2

Comment by Gerrit Updater [ 05/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46863/
Subject: LU-13195 osp: invalidate object on write error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: be237a523e1208888f8f7d10e2a88709ea823a74

Comment by Gerrit Updater [ 20/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47010/
Subject: LU-13195 osp: osp_send_update_req() should check generation
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: b18246d8b78a308c32a5f78eee581f16dae5dc44

Generated at Sat Feb 10 02:59:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.