Skip to content
This repository was archived by the owner on Jan 14, 2025. It is now read-only.

Commit

Permalink
Add more diagrams and formatted documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
dannyhchen committed Oct 11, 2018
1 parent 424821d commit 906aa45
Show file tree
Hide file tree
Showing 4 changed files with 17 additions and 2 deletions.
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Marmaray is a generic Hadoop data ingestion and dispersal framework and library.
Marmaray describes a number of abstractions to support the ingestion of any source to any sink. They are described at a high-level below to help developers understand the architecture and design of the overall system.

This system has been canonically used to ingest data into a Hadoop data lake and disperse data from a data lake to online data stores usually with lower latency semantics. The framework was intentionally designed, however, to not be tightly coupled to just this particular use case and can move data from any source to any sink.
**End to End Jobflow**

**End-to-End Job Flow**

The figure below illustrates a high level flow of how Marmaray jobs are orchestrated, independent of the specific source or sink.

Expand All @@ -23,6 +24,14 @@ During this process, a configuration defining specific attributes for each sourc

The following sections give an overview of each of the major components that enable the job flow previously illustrated.

**High-Level Architecture**

The architecture diagram below illustrates the fundamental building blocks and abstractions in Marmaray that enable its overall job flow. These generic components facilitate the ability to add extensions to Marmaray, letting it support new sources and sinks.

<p align="center">
<img src="docs/images/High_Level_Architecture.png">
</p>

**Avro Payload**

The central component of Marmaray’s architecture is what we call the AvroPayload, a wrapper around Avro’s GenericRecord binary encoding format which includes relevant metadata for our data processing needs.
Expand All @@ -40,7 +49,6 @@ This is illustrated in the figure below:
</p>



**Data Model**

The central component of our architecture is the introduction of the concept of what we termed the AvroPayload. AvroPayload acts as a wrapper around Avro’s GenericRecord binary encoding format along with relevant metadata for our data processing needs. One of the major benefits of Avro data (GenericRecord) is that once an Avro schema is registered with Spark, data is only sent during internode network transfers and disk writes which are then highly optimized. Using Avro data running on top of Spark’s architecture means we can also take advantage of Spark’s data compression and encryption features. These benefits factor heavily in helping our Spark jobs handle data at large scale more efficiently. Avro includes a schema to specify the structure of the data being encoded while also supporting schema evolution. For large data files, we take advantage that each record is encoded with the same schema and this schema only needs to be defined once in the file which reduces overhead. To support our any-source to any-sink architecture, we require that all ingestion sources define converters from their schema format to Avro and that all dispersal sinks define converters from the Avro Schema to the native sink data model (i.e ByteBuffers for Cassandra).
Expand Down Expand Up @@ -80,6 +88,9 @@ All Marmaray jobs need a persistent store, known as the metadata manager, to sto

When a job begins execution, an in memory copy of the current metadata is created and shared with the appropriate job components which will need to update the in-memory copy during job execution. If the job fails, this in memory copy will be discarded to ensure that the next run will start from the previously saved state of the last successful run. If the job succeeds the in-memory copy is now saved to the persistent store. As of now since the metadata manager has an in-memory copy there is a limitation on the amount of metadata a job can store

<p align="center">
<img src="docs/images/Metadata_Manager.png">
</p>

**Fork Operator**

Expand All @@ -89,6 +100,10 @@ The internal execution engine of Spark performs all operations in a manner of la

A provided ForkFunction is used by the ForkOperator to tag each datum with a valid or error annotation. These ForkOperators are called by our data converters during job execution. Users can now filter to get the desired collection of tagged records. These records are persisted in Spark to avoid having to re-read the raw input and re-apply the transformation when filtering. By default we currently use DISK_ONLY persistence to avoid memory overhead and pressure. These components are used in DataConverters to split input stream into 2 streams (output + error) but it can be used for splitting it into more than 2 streams with overlapping records if desired. For example, we could decide to split an input stream of integers (1 to 6) into an even number stream (2,4,6), odd number stream (1,3,5) and a multiple of 3 stream (3,6).

<p align="center">
<img src="docs/images/ForkOperator_ForkFunction.png">
</p>

**JobDag**

The JobDag component orchestrates and performs the actual execution of the Job. It does the following:
Expand Down
Binary file added docs/images/ForkOperator_ForkFunction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/High_Level_Architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Metadata_Manager.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 906aa45

Please sign in to comment.