Content of the course
In a traditional data team, we’re going to recognize the data engineer, the data analyst and the data scientist. The data engineer prepares and maintains the infrastructure the data team needs, where the data analyst and the data scientist are going to be using the data hosted in that infrastructure to answer questions and solve problems. The difference between the two last ones is that data scientists tend to focus more on building complex statistical models and machine learning algorithms, while data analysts concentrate on exploring data and creating reports to drive business decisions.
The data engineers are great software engineers, but they don’t have the training in how the data is actually going to be used by the business users. This gap between technical implementation and business understanding is where analytics engineering comes in. Analytics engineers combine the technical skills of data engineering with the business acumen of analytics, helping to bridge this divide and ensure data is both well-structured and business-relevant: they introduce the good software engineering practices to the efforts of data analysts and data scientists.
The analytics engineers may use different tools according the objective they are pursuing. It can be :
This week, we’ll be focusing on the two last parts. D
Let’s recap on the differences between ETL and ELT.
In ETL (Extract, Transform, Load), data is extracted, transformed before being loaded into the data warehouse. This means the transformation happens in a staging area outside the warehouse. The process is more rigid and requires more storage and computing resources, since data must be transformed before loading. This also mean that we have more stable and compliant data, because it’s clean.
In ELT (Extract, Load, Transform), raw data is loaded directly into the data warehouse first, and transformations happen within the warehouse itself. This approach is more flexible and leverages the computing power of modern cloud data warehouses, making it the preferred method for many modern data stacks.