Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20180

LMR2a: allow multiple lu_dirent per striped directory

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      As part of Lustre Metadata Redundancy, a replicated (striped) directory must be able to store multiple lu_dirent structs in different directory shards. For a replicated directory, the number of stripes/shards must be at least the number of replicas. The most straight forward option for selecting directory shards to use for lu_dirent replicas would be to use lmv_name_to_stripe_index(name) to select the "primary" dirent to be stored on stripe N, and the backup dirents are stored on stripe N+1, stripe N+2, ... (modulo lmv_stripe_count). This has the benefit of being easily understood and implemented, and since the dirhash function itself should be selecting an initial MDTn shard uniformly for each name, the global distribution of dirents should be uniform across MDTs independent of the number of shards. With proper MDT selection for shards at directory creation time taking failure_domain into account (LU-19066) it should be possible to avoid all replicas for a given dirent being offline at once. The uniform pattern for MDT replica selection also makes it easier to select well-placed MDTs to minimize multiple concurrent failures, rather than a random MDT selection that just ensures different failure domains.

      Regardless of how LMV selects which shards to use for the lu_dirent copies, there will be multiple entries in a replicated directory with the same name, that should not all be returned with a readdir() operation. It is undesirable to do a full readdir and sort the entries for uniqueness, given that some directories may have tens of millions or potentially billions of entries. The replica entries should be marked with a LUDA_REPLICA flag in lde_attrs so that clients can skip them under normal directory readdir() processing. If doing readdir() reconstruction for an offline directory shard, the client can hash lde_name to determine the primary shard number for that name and show the dirent if that shard is offline.

      One caveat is that if a directory was being split or migrated when an MDT goes offline, then the "primary" stripe index of a dirent may not reflect its current stripe index, so a dirent may be shown or skipped incorrectly in this case. It seems desirable to reserve a sequence of bits in lde_attrs (e.g. 4 bits, up to 16 replicas) to store the "replica number" for the lu_dirent with LUDA_REPLICA1, LUDA_REPLICA2, etc. Then when performing readdir reconstruction, it is possible to subtract the replica number from the stripe number where the replica was found to determine if the primary lu_dirent shard is offline before displaying the LUDA_REPLICA1 entry. If multiple shards are offline, then progressively the LUDA_REPLICA2, ... dirent replica would be shown if both the primary and LUDA_REPLICA1 shards are offline.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: