IlerngutCOM

What is a storage repository that holds a vast amount of raw data in its original format until the business needs it multiple choice?

1 Jahrs vor

Kommentare: 0

Ansichten: 68

22.The Crunch Factory is one of the fourth-largest gyms operating in Australia, and each gymoperates its system with individual databases. Because of this, the company failed to develop anydata-capturing standards and now faces the challenges associated with low-quality enterprisewide

Inhaltsverzeichnis Show

Data Lakes Architecture are storage repositories for large volumes of data. Certainly, one of the greatest features of this solution is the fact that you can store all your data in native format within it.
While data flows through the Lake, you may think of it as a next step of logical data processing.
Data Lake Architecture: Important Components
Other important aspects
What is a storage repository that holds a vast amount of raw data in its original format until the business needs it multiple choice question?
What is a storage repository that holds a vast amount of raw data in its original format until the business needs it data broker data lake data map data point?
What is a storage repository that holds?
What is a data lake used for?

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

What is a storage repository that holds a vast amount of raw data in its original format until the

Get answer to your question and much more

What is a technique for establishing a match, or balance, between the source data and the

Get answer to your question and much more

What is a type of distributed ledger, consisting of blocks of data that maintain a permanent and

Get answer to your question and much more

What is an area of a website that stores information about products in a database? - Dynamiccatalog

What is an individual item on a graph or a chart? - Data point

What is an individual item on a graph or a chart? - Data point

What is an interactive website kept constantly updated and relevant to the needs of itscustomers through the use of a database?

Get answer to your question and much more

What is an interactive website that uses a database to constantly update in order to remain

Get answer to your question and much more

What is an organized collection of data? - Data Set

Get answer to your question and much more

Get answer to your question and much more

Data Lakes Architecture are storage repositories for large volumes of data. Certainly, one of the greatest features of this solution is the fact that you can store all your data in native format within it.

For instance, you might be interested in the ingestion of:

Operational data (sales, finances, inventory)
Auto-generated data (IoT devices, logs)
Human-generated data (social media posts, emails, web content) either coming from inside, or from outside the organization.

Layering

We may think of Data Lakes as single repositories. However, we have the flexibility to divide them into separate layers. From our experience, we can distinguish 3-5 layers that can be applied to most cases.

These layers are:

Raw
Standardized
Cleansed
Application
Sandbox

However, Standardized and Sanbox are considered to be optional for most implementations. Let’s dive into the details to help you understand their purpose.

Raw data layer – also called the Ingestion Layer/Landing Area, because it is literally the sink of our Data Lake. The main objective is to ingest data into Raw as quickly and as efficiently as possible. To do so, data should remain in its native format. We don’t allow any transformations at this stage. With Raw, we can get back to a point in time, since the archive is maintained. No overriding is allowed, which means handling duplicates and different versions of the same data. Despite allowing the above, Raw still needs to be organized into folders. From our experience we advise customers to start with generic division: subject area/data source/object/year/month/day of ingestion/raw data. It is important to mention that end users shouldn’t be granted access to this layer. The data here is not ready to be used, it requires a lot of knowledge in terms of appropriate and relevant consumption. Raw is quite similar to the well-known DWH staging.
Standardized data layer – may be considered as optional in most implementations. If we anticipate that our Data Lake Architecture will grow fast, this is the right direction. The main objective of this layer is to improve performance in data transfer from Raw to Curated. Both daily transformations and on-demand loads are included. While in Raw, data is stored in its native format, in Standardized we choose the format that fits best for cleansing. The structure is the same as in the previous layer but it may be partitioned to lower grain if needed.
Cleansed data layer – also called Curated Layer/Conformed Layer. Data is transformed into consumable data sets and it may be stored in files or tables. The purpose of the data, as well as its structure at this stage is already known. You should expect cleansing and transformations before this layer. Also, denormalization and consolidation of different objects is common. Due to all of the above, this is the most complex part of the whole Data Lake solution. In regards to organizing your data, the structure is quite simple and straightforward. For example: Purpose/Type/Files. Usually, end users are granted access only to this layer.
Application data layer – also called the Trusted Layer/Secure Layer/Production Layer, sourced from Cleansed and enforced with any needed business logic. These might be surrogate keys shared among the application, row level security or anything else that is specific to the application consuming this layer. If any of your applications use machine learning models that are calculated on your Data Lake, you will also get them from here. The structure of the data will remain the same, as in Cleansed.
Sandbox data layer – another layer that might be considered optional, is meant for advanced analysts’ and data scientists’ work. Here they can carry out their experiments when looking for patterns or correlations. Whenever you have an idea to enrich your data with any source from the Internet, Sandbox is the proper place for this.

While data flows through the Lake, you may think of it as a next step of logical data processing.

Data Lake Architecture: Important Components

Since we have covered the most vital parts of Data Lakes, its layers; we may now move on to the other logical components that create our solution. Let’s look at the diagram below:

Security – even though you will not expose the Data Lakes to a broad audience, it is still very important that you think this aspect through, especially during the initial phase and architecturing. It’s not like relational databases, with an artillery of security mechanisms. Be careful and never underestimate this aspect.
Governance – monitoring and logging (or lineage) operations will become crucial at some point for measuring performance and adjusting the Data Lake.
Metadata – data about data, so mostly all the schemas, reload intervals, additional descriptions of the purpose of data, with descriptions on how it is meant to be used.
Stewardship – depending on the scale you might need, either separate team (role) or delegate this responsibility to the owners (users), possibly through some metadata solutions.
Master Data – an essential part of serving ready-to-use data. You need to either find a way to store your MD on the Data Lake, or reference it while executing ELT processes.
Archive – in case you have an additional relational DWH solution. You might face some performance and storage related problems in the area. Data Lakes are often used to keep some archive data that comes originally from DWH.
Offload – and again, in case you have other relational DWH solutions, you might want to use this area in order to offload some time/resource consuming ETL processes to your Data Lake, which might be way cheaper and faster.
Orchestration + ELT processes – as data is being pushed from the Raw Layer, through the Cleansed to the Sandbox and Application layer, you need a tool to orchestrate the flow. Most likely, you will need to apply transformations. Either you choose an orchestration tool capable of doing so, or you need some additional resources to execute them.

Other important aspects

You may think of Data Lakes as the Holy Grail of self-organizing storage. I have heard “Let’s ingest in, and it’s done” so many times. In fact, the reality is different and with this approach we will end up with something called Data Swamp.

Literally, it is an implementation of Data Lake Architecture storage, but it lacks either clear layer division or other components discussed in the article. Over time it becomes so messy, that getting the data we were looking for is nearly impossible. We should not undermine the importance of security, governance, stewardship, metadata and master data management.

A well-planned approach of designing these areas is essential to any Data Lake implementation. I highly encourage everyone to think of the desired structure they would like to work with. On the other hand, being too strict in these areas will cause Data Desert (opposite to Data Swamp). The Data Lake itself should be more about empowering people, rather than overregulating.

Most of the above problems may be solved by planning the desired structure inside your Data Lake Layers and by putting reliable owners in charge.

From our experience, we see that the organization of Data Lakes can be influenced by:

Time partitioning
Data load patterns (Real-time, Streaming, Incremental, Full load, One time)
Subject areas/source
Security boundaries
Downstream app/purpose/uses
Owner/stewardship
Retention policies (temporary, permanent, time-fixed)
Business impact (Critical, High, Medium, Low)
Confidential classification (Public information, Internal use only, Supplier/partner confidential, Personally identifiable information, Sensitive – financial)

Summing up

To sum up, let’s go over the main objectives, what implementing any Data Lake should accomplish. With the above knowledge, their explanation is going to be simple:

3 v’s (Velocity, Variety, Volume). We may operate on a variety of data, high in volume, with incredible velocity. The important fact is that velocity stands here not only for the processing time, but also for time to value, since it’s easier to build prototypes and explore data.
Reduced effort to ingest data (Raw Layer), delay work to plan the schema and create models until the value of the data is known.
Facilitate advanced analytics scenarios, new use cases with new types of data. Starting MVPs with data that are ready to use. This saves both time and money.
Store large volumes of data cost efficiently. There is no need to think if the data will be used, it can be stored just-in-case. Properly governed and managed data can be collected till the day we realize that it might be useful.

What is a storage repository that holds a vast amount of raw data in its original format until the business needs it multiple choice question?

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications. While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake uses a flat architecture to store data, primarily in files or object storage.

What is a storage repository that holds a vast amount of raw data in its original format until the business needs it data broker data lake data map data point?

If you're not already familiar with the term, a “data lake” is generally defined as an expansive collection of data that's held in its original format until needed. Data lakes are repositories of raw data, collected over time, and intended to grow continually.

What is a storage repository that holds?

A storage repository is essentially logical disk space made available through a file system on top of physical storage hardware.

What is a data lake used for?

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits. Learn more about modernizing your data lake on Google Cloud.

storage repository holds vast amount raw data original format business multiple choice

zusammenhängende Posts

What is a storage repository that holds a vast amount of raw data in its original format until the business needs it data broker data lake data map data point?

What is a storage repository that holds a vast amount of raw data in its original format until the business needs it data broker data lake data map data point?

What is the duplication of data or the storage of the same data in multiple places quizlet?

What is the duplication of data or the storage of the same data in multiple places quizlet?

_____ is a nonvolatile, chip-based storage, often used in mobile phones, cameras, and mp3 players.

_____ is a nonvolatile, chip-based storage, often used in mobile phones, cameras, and mp3 players.

Encrypting the data within databases and storage devices gives an added layer of security.

Encrypting the data within databases and storage devices gives an added layer of security.

Using a local repository, administrators can limit what users can install on their systems

Using a local repository, administrators can limit what users can install on their systems

Which component of printer sharing in windows is a storage location for pending print jobs?

Which component of printer sharing in windows is a storage location for pending print jobs?

What is the set of processes that plans for and controls the efficient and effective transportation and storage of suppliers from suppliers to customers?

What is the set of processes that plans for and controls the efficient and effective transportation and storage of suppliers from suppliers to customers?

What are the major computer hardware, data storage, input, and output technologies used in business

What are the major computer hardware, data storage, input, and output technologies used in business

What is a flash memory storage device that contains its own processor to manage its storage?

What is a flash memory storage device that contains its own processor to manage its storage?

What kind of storage is an internet service that provides storage to computer or mobile users group of answer choices?

What kind of storage is an internet service that provides storage to computer or mobile users group of answer choices?