WTA Insights Project, Azure Data Factory

8. ADF schedule triggers

So far, we have executed the pipelines in Debug or have run the pipelines once using the Trigger now option. To automate future loads of csv files we will now look at how to schedule pipeline executions using schedule triggers. There are three types of triggers in ADF: Schedule – runs pipelines periodically (every hour,Continue reading “8. ADF schedule triggers”

7. ADF integration with GitHub

Recall that, when we provisioned the Azure Data Factory resource, we chose to configure Git later. In this post, I will show you how to configure source control from Azure Data Factory UI. The pipelines, as well as all the code, scripts and files associated with the WTA Insights project are available on my GitHubContinue reading “7. ADF integration with GitHub”

6. Parameters, variables and loops – part 2

Let’s build a third pipeline to copy the wta_matches files. Matches Pipeline Take a moment to explore the matches files in github. The files follow a consistent naming convention: wta_matches_yyyy.csv, where yyyy represents the year of the WTA season and is a value between 1968 and 2020. Explore the files in raw view. Note thatContinue reading “6. Parameters, variables and loops – part 2”

5. Parameters, variables and loops – part 1

In the previous post, we have created a simple pipeline that fetches the wta_players.csv from HTTP (github) and stores it in our data lake. We are now going to build another pipeline, that fetches the ranking files. Rankings Pipeline Take a moment to explore the ranking files in github. As of the date of thisContinue reading “5. Parameters, variables and loops – part 1”

4. Our first pipeline

Let’s start by taking small baby steps. Our first pipeline will copy the wta_players.csv from github to our datalake. Then, we will learn to make some bigger steps. We will learn to implement more complex logic in our pipelines and make use of parameters, variables and loops. A second pipeline will fetch the wta_rankings csvContinue reading “4. Our first pipeline”

3. Datasets

Now that we have defined the connection information needed by ADF to connect to github and to our ADLS, by creating two linked services, the next step is to tell ADF what data to use from within the data sources. For this we need to create datasets. Datasets identify data within the linked data stores,Continue reading “3. Datasets”

2. Linked services

On the Azure portal, go to the newly created data factory and click on the Author & Monitor tile. This will launch the Azure Data Factory user interface on a separate tab. On the Let’s get started page, click on the expand button on the top-left corner to expand the left sidebar. There are 4Continue reading “2. Linked services”

1. Provisioning Azure Data Factory

The next piece of the puzzle is to fetch the csv files from the tennis_wta repository on github. For that we need to prepare another Azure resource – a data factory. On the Azure portal, select + Create a resource, in the upper left-hand corner, then do a quick search for data factory. On theContinue reading “1. Provisioning Azure Data Factory”

ADF Series

Azure Data Factory (ADF) is Azure’s cloud-managed service for ETL and ELT processes. It is similar to SSIS, but in the cloud. ADF can connect to various data sources, on-premises or in the cloud. If the data you want to access is on-premises, you will need to configure a data management gateway to connect toContinue reading “ADF Series”