[LU-13602] Skip unknown FLR component types Created: 27/May/20  Updated: 02/Aug/21  Resolved: 29/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Li Xi Assignee: Qian Yingjin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13637 Combining RO-PCC/RW-PCC/HSM with FLR HLD Open
is related to LU-10499 Readonly Persistent Client Cache support Closed
is related to LU-14319 Make foreign layout as a mirror compo... Open
is related to LU-14389 crash in lov_delete_composite() with ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Currently, in lov_init_composite() will quit with error when reading an unknown LOV pattern:

			CERROR("%s: unknown composite layout entry type %i\n",
			       lov2obd(dev->ld_lov)->obd_name,
			       lsm->lsm_entries[i]->lsme_pattern);

Since FLR will be used for upcoming new features, like PCC-RO, an old client should still be able to read the old format mirrors and ignore the new types of mirrors.

On the MDS, it should skip mirrors with unknown/foreign component types and mark them stale when selecting which component to use for writes.



 Comments   
Comment by Gerrit Updater [ 27/Jul/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/39513
Subject: LU-13602 flr: skip unknown FLR component types
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3bb436e0172f1a2e3a0fa4806cf15f3c1d9682e3

Comment by Andreas Dilger [ 26/Nov/20 ]

Yingjin, could you please add a short description of what you think is needed to implement this.

To my thinking, the main places that need to be changed are mirror selection for writes (on the MDS) and reads (on the client) to skip unknown component types (e.g. unknown layout types, foreign components, PCC/HSM components on remote nodes, etc) in addition to the normal skip of STALE components. For PCC-RO the client can select the local cache copy if it is not STALE.

Is there some other change that is needed here which I'm missing?

Comment by Qian Yingjin [ 27/Nov/20 ]

Hi Andreas,

The problem is not PCC-RW as it is actual a HSM solution and the file data can be restored when a conflict access happens.

The main problem is about PCC-RO. When a file is PCC-RO cached on a client, the components on Lustre OSTs and on the PCC copies are both in update-to-date state (not STALE state). All client can do read operations on the components with data on Lustre OSTs.
We use layout lock to protect the validity for PCC read-only caching services.
When a file is PCC-RO cached on a client, the layout on the MDT is flagged with L.rdonly state. When a client modifies this file L.rdonly state, it must clear the read-only state on MDT with an intent lock request with a write layout intent. When MDT received this write layout intent request, it will acquire EX layout lock on the file which will revoke all layout locks on clients and invalidate the cached layouts.
However, for a old client without PCC-RO support, it can not recognize that the file is in PCC-RO state after the server grants the layout lock to this client, thus it can directly write to the component with data on Lustre OSTs without clearing the L.rdonly state on the MDT. In this case, some client will continue to read data from the local PCC-RO copy while the old client is writing the file content, this results in inconsistent access.

To solve this problem, the solutions may be:

  • Add a PCC-RO connection flag;
  • When a client without PCC-RO support requests a layout lock on the file that is in L.rdonly state on MDT, the MDT clears the L.rdonly flag on the layout first (this maybe invalidate all PCC-RO cached copies on clients), and then return the layout to the old client?

Regards,
Qian

Comment by Andreas Dilger [ 27/Nov/20 ]

When you write "for a old client without PCC-RO support", shouldn't the normal FLR handling cover this condition? We already do not allow clients that do not understand FLR from accessing files with mirror components, and PCC-RO should be handled in the same way as any other FLR mirror. The main difference should be that a PCC-RO client has the added option of reading the file from the local cache copy (if the file exists there). Any non-PCC client can only read from the non-PCC mirror(s) of the file, but there should always be a non-PCC mirror for PCC-RO. I can't see any reason why a non-PCC client reading from a file should cause the PCC-RO cache copies to be invalidated?

It should only be writing to a FLR file that should cause the layout to be invalidated, and all but one of the mirrors to be marked stale. This should already be handled by normal FLR mechanisms to revoke the layout lock from the clients (including PCC-RO clients), and the PCC client should only be able to access the local cache copy again when it has a layout lock, and the data version of the cache copy is still valid (i.e. the data was not modified). It may be that the PCC code needs to be changed so that it computes and stores the "data version" of only the non-PCC mirror copy instead of the whole layout, so that the PCC copy does not get invalidated just because another mirror was added, or similar.

Comment by Qian Yingjin [ 28/Nov/20 ]

Combining with FLR, we need to extend the data structure `enum lov_comp_md_flags`.

/**
 * on-disk data for lcm_flags. Valid if lcm_magic is LOV_MAGIC_COMP_V1.
 */
enum lov_comp_md_flags {
				/* the least 2 bits are used by FLR to record file state */
				LCM_FL_NONE          		= 0x0,
				LCM_FL_RDONLY           = 0x1,
				LCM_FL_WRITE_PENDING    = 0x2,
				LCM_FL_SYNC_PENDING     = 0x4,
				LCM_FL_PCC_RDONLY		= ox8, /* The file is RO-PCC cached */
				LCM_FL_FLR_MASK         = 0xF,
};

Please note the flags LCM_FL_PCC_RDONLY is different from LCM_FL_RDONLY here.
A PCC-RO cached file could be in the state:

  • LCM_FL_PCC_RDONLY | LCM_FL_RDONLY: it means that all components are synced and in up-to-date state, and then one client attaches the file into the PCC backend FS.
  • LCM_FL_PCC_RDONLY | LCM_FL_WRITE_PENDING: It means the file was once modified by a client, the data content of all components are not synced. The MDT has already picked a primary replication and marked other components as STALE. At this time, a client can still PCC-RO attach the file. For this client, the primary component and the PCC copy are both in up-to-date state.
  • ...

From my own opinion, I do not think PCC-RO attaching a file in LCM_FL_WRITE_PENDING state needs to sync all mirrors and transmit the file into LCM_FL_RDONLY state which may be time-consuming. In our implementation, we only need to add a new PCC-RO foreign layout component and add LCM_FL_PCC_RDONLY into the FLR flags.

As we add a new LCM_FL_PCC_RDONLY state, thus the old client may not understand this new FLR component flag, and may result in inconsistent access.

Regards,
Qian

Comment by Qian Yingjin [ 28/Nov/20 ]

The old on-disk data for lcm_flags are:

/**
 * on-disk data for lcm_flags. Valid if lcm_magic is LOV_MAGIC_COMP_V1.
 */
enum lov_comp_md_flags {
	/* the least 2 bits are used by FLR to record file state */
	LCM_FL_NONE          = 0,
	LCM_FL_RDONLY           = 1,
	LCM_FL_WRITE_PENDING    = 2,
	LCM_FL_SYNC_PENDING     = 3,
	LCM_FL_FLR_MASK         = 0x3,
};

We have also changed the value of LCM_FL_SYNC_PENDING...

Maybe the new lov_comp_md_flags should be:

/**
 * on-disk data for lcm_flags. Valid if lcm_magic is LOV_MAGIC_COMP_V1.
 */
enum lov_comp_md_flags {
	/* the least 2 bits are used by FLR to record file state */
	LCM_FL_NONE          = 0x00,
	LCM_FL_RDONLY           = 0x01,
	LCM_FL_WRITE_PENDING    = 0x02,
	LCM_FL_SYNC_PENDING     = 0x03,
        LCM_FL_PCC_RDONLY = 0x10,
	LCM_FL_FLR_MASK         = 0x13,
};

we don't change the old FLR component flags, add a new LCM_FL_PCC_RDONLY from 0x10.

Comment by Qian Yingjin [ 01/Dec/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40791
Subject: LU-10499 pcc: introducing OBD_CONNECT2_PCCRO flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d2638ccd5c1a327840c68d875eb30a0a3f4f895d

Comment by Gerrit Updater [ 01/Dec/20 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/40813
Subject: LU-13602 pcc: add LCM_FL_PCC_RDONLY layout flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2dc8af64b58ab5cc6d3ae4964ceda4271d2baef4

Comment by Andreas Dilger [ 14/Dec/20 ]

Yingjin, neither of these patches are addressing the main issue of this ticket, which is to allow the MDS and clients to skip layout types that they do not understand. In particular, in the new PCC code, it is using a foreign layout as part of a file, and if that file is being modified by a client then the PCC component needs to be marked STALE. The same should be done for any other layout type that the MDS does not understand.

On the client, it should not try to read from a component type that it doesn't understand.

Comment by Qian Yingjin [ 07/Jan/21 ]

Updated the patch https://review.whamcloud.com/#/c/39513/.

It will skip the unknown FOREIGN layout component that the client does not support or understand.

Comment by Gerrit Updater [ 22/Jan/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39513/
Subject: LU-13602 flr: skip unknown FLR component types
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61a002cd8631ecd35a0326dba68f0eb6e48dc44f

Comment by Gerrit Updater [ 28/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40813/
Subject: LU-13602 pcc: add LCM_FL_PCC_RDONLY layout flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: adc1bbbf20e0a8a53274aa4590ed0935f954d1bc

Comment by Peter Jones [ 29/Jul/21 ]

Landed for 2.15

Comment by James A Simmons [ 02/Aug/21 ]

For some reason the change in the patch LCM_FL_PCC_RDONLY exposes a bug that I have seen with the port to the native Linux client. With the patch I get:

Lustre: DEBUG MARKER: == sanity test 63b: async write errors should be returned to fsync ============================

======= 10:58:33 (1627916313)

Lustre: *** cfs_fail_loc=406, val=0***

LustreError: 20036:0:(osc_request.c:2586:osc_build_rpc()) prep_req failed: -12

LustreError: 20036:0:(osc_cache.c:2259:osc_check_rpcs()) Write request failed with -12

 LustreError: 20042:0:(lu_object.c:215:lu_object_put()) ASSERTION( list_empty(&top->loh_lru) ) failed:

LustreError: 20042:0:(lu_object.c:215:lu_object_put()) LBUG

kernel: Pid: 20042, comm: lctl 5.7.0-rc7+ #1 SMP PREEMPT Sat Jul 31 13:04:46 EDT 2021

kernel: Call Trace:

kernel: libcfs_call_trace+0x62/0x80 [libcfs]

kernel: lbug_with_loc+0x41/0xa0 [libcfs]

kernel: lu_object_put+0x31b/0x340 [obdclass]

kernel: vvp_pgcache_current+0x7e/0x150 [lustre]

kernel: seq_read+0x200/0x3c0

kernel: full_proxy_read+0x4d/0x70

kernel: vfs_read+0xb3/0x17

kernel: ksys_read+0x5f/0xd0

kernel: do_syscall_64+0x6d/0x4f0

kernel: entry_SYSCALL_64_after_hwframe+0x49/0xb3

kernel: Kernel panic - not syncing: LBUG

Generated at Sat Feb 10 03:02:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.