[LU-8562] osp_precreate_cleanup_orphans/osp_precreate_reserve race may cause data loss Created: 29/Aug/16 Updated: 16/Jul/19 Resolved: 16/Feb/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Cheremencev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
osp_statfs_interpret can clear error in opd_pre_status despite of the
Below is reproducer that works only on singe node setup: diff --git a/lustre/tests/conf-sanity.sh b/lustre/tests/conf-sanity.sh
index c64ebab..f5026dc 100755
--- a/lustre/tests/conf-sanity.sh
+++ b/lustre/tests/conf-sanity.sh
@@ -6796,6 +6796,32 @@ test_97() {
}
run_test 97 "ldev returns correct ouput when querying based on role"
+test_98() {
+ local_mode || { skip "Need single node setup"; return; }
+ local cmp=0
+ local dev=$FSNAME-OST0000-osc-MDT0000
+ setupall
+
+ createmany -o $DIR1/$tfile-%d 50000&
+ cmp=$!
+ # MDT->OST reconnection causes MDT<->OST last_id synchornisation
+ # via osp_precreate_cleanup_orphans.
+ for i in $(seq 0 100); do
+ for k in $(seq 0 10); do
+ $LCTL --device $dev deactivate
+ $LCTL --device $dev activate
+ done
+ ls -asl $MOUNT | grep '???' && \
+ (kill -9 $cmp &>/dev/null; \
+ error "File hasn't object on OST")
+ ps -A -o pid | grep $cmp 1>/dev/null || break
+ done
+ wait $cmp
+ stopall
+}
+run_test 98 "Race MDT->OST reconnection with create"
+
+
|
| Comments |
| Comment by Sergey Cheremencev [ 31/Aug/16 ] |
| Comment by Sergey Cheremencev [ 11/Oct/16 ] |
|
We observed that patch needs to be changed. |
| Comment by Gerrit Updater [ 23/Dec/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22211/ |
| Comment by Peter Jones [ 23/Dec/16 ] |
|
Landed for 2.10 |
| Comment by Mikhail Pershin [ 24/Dec/16 ] |
|
reopen due to |
| Comment by Ned Bass [ 28/Dec/16 ] |
|
Why was this not a blocker for 2.9? |
| Comment by Ned Bass [ 30/Dec/16 ] |
|
I was testing out patch 22211 and (if my understanding is correct) may have found a defect. It seems osp_precreate_thread() can get stuck because d->opd_got_disconnected never gets reset. When opd_got_disconnected is set, osp_precreate_cleanup_orphans() returns early with EAGAIN and can't clear d->opd_pre_recovering. And because d->opd_pre_recovering can't be cleared we always hit the break statement below and don't clear d->opd_got_disconnected. So osp_precreate_cleanup_orphans() is stuck always failing.
while (osp_precreate_running(d)) {
/*
* need to be connected to OST
*/
while (osp_precreate_running(d)) {
+ if (d->opd_pre_recovering &&
+ d->opd_imp_connected)
+ break;
l_wait_event(d->opd_pre_waitq,
!osp_precreate_running(d) ||
d->opd_new_connection,
&lwi);
if (!d->opd_new_connection)
continue;
d->opd_new_connection = 0;
d->opd_got_disconnected = 0;
break;
}
if (!osp_precreate_running(d))
break;
LASSERT(d->opd_obd->u.cli.cl_seq != NULL);
/* Sigh, fid client is not ready yet */
if (d->opd_obd->u.cli.cl_seq->lcs_exp == NULL)
continue;
/* Init fid for osp_precreate if necessary */
rc = osp_init_pre_fid(d);
if (rc != 0) {
class_export_put(d->opd_exp);
d->opd_obd->u.cli.cl_seq->lcs_exp = NULL;
CERROR("%s: init pre fid error: rc = %d\n",
d->opd_obd->obd_name, rc);
continue;
}
osp_statfs_update(d);
/*
* Clean up orphans or recreate missing objects.
*/
rc = osp_precreate_cleanup_orphans(&env, d);
- if (rc != 0)
+ if (rc != 0) {
+ schedule_timeout_interruptible(cfs_time_seconds(1));
continue;
+ }
/*
* connected, can handle precreates now
*/
|
| Comment by Gerrit Updater [ 07/Jan/17 ] |
|
Ned Bass (bass6@llnl.gov) uploaded a new patch: https://review.whamcloud.com/24758 |
| Comment by Gerrit Updater [ 24/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24758/ |
| Comment by Minh Diep [ 16/Feb/17 ] |
|
Landed in Lustre 2.10 |