Source: Talent Path Blog

Talent Path Blog Data Workflows 101: Extracting, Transforming, and Loading (ETL)

By Oswald Vinueza, Data Science Consultant at Talent PathWhat is ETL?When dealing with databases, data lakes, and data warehouses a common procedure in the data workflow is to extract, transform, and load-otherwise known as ETL. ETL is the process of exacting data from multiple sources, then transforming it in a way that fits the business requirements for data analysis, and then loading into a final repository. Although the big three platforms (AWS, Azure and Goggle Cloud) have packaged ETL solutions, it is recommended to have a comprehensive understanding of the business requirements and needs before jumping to use these tools or writing extensive code. Data ExtractionWhen getting started, here are some key pointers to keep in mind: Understand the data source(s) and check for data quality. Have a plan for optimal data extraction. Think about data exploration and cleansing as part of transforming the data. Does the transformed data meet the requirements? (Are the parties of interest satisfied with this step?) Understand the transformed data's end destination. (What kind of cost are involved? Is there a need to keep backup copy? Do you need high level security?) Once it is time to complete data extractions, often there are a diverse range of data types and sources. Some of the most common sources are databases, web services and flat files. During the extraction process, scientists should take into consideration the performance and the response time it takes to directly obtain it and if it is already optimized for reporting and analysis. Note, this process changes the way the data is originally reported and will cause errors on the code used for the extraction step. Data Cleaning and Transformation The data cleansing part of the ETL process deals with the detection of invalid, duplicate, and inconsistent data to improve the overall quality and usability of the data. While there are several suitable ways of dealing with data cleansing, it can be simplified into the following guidelines: Analyze the metadata and its properties to help detect quality issues. Resolve problems such as structural conflicts, inconsistent aggregation, contradictory values or data entry errors. Verify for correctness and effectiveness of the cleansing step. If there are any unresolved issues is time to consult or bring to someone's attention. Data transformation helps create the structure in which the data will be finally stored. Once the cleansing process is completed, we can move to further processing of the data. Some common transformations include: Normalization which is a process that gets rid of redundancies in data tables. 2NF and 3NF are the most common. Merging which is commonly used on related fields or strings that can be one data column. Splitting is commonly used on string type fields, to add help add, or refine information. Aggregation for things such as weekly profit margins or monthly totals. Data Loading Once the data transformation is completed, we can proceed to load the data into a database or data warehouse. It is important to keep in mind that having set constraints under this model can help deal the final integrity of the data tables, but at the same time, can cause performance bottlenecks when dealing with copious quantities of data. Some considerations for this step are: Removing unwanted indexes. Check loading times, performance and bottleneck issues. If you are on a paid cloud environment, check your cost for storage. Check for completeness and inaccuracies against the transformed data. Disadvantages of Traditional ETL Process Depending on the specificity of some business requirements, data transformation tends to be unique and require custom scripts, which can be complicated and troublesome if the data sources change the way which they are reported. Transformed data will limit the exploration by parties that may want to dig deeper into the source data. ETL best works with small datasets and incremental loads that require complex transformations. Are you curious about becoming a data analyst?Contact a Talent Advisor

Read full article »
Est. Annual Revenue
$5.0-25M
Est. Employees
100-250
Kip Wright's photo - President & CEO of Talent Path

President & CEO

Kip Wright

CEO Approval Rating

90/100

Read more