Difference between ETL and ELT
Small overview of ETL and ELT ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) are necessary because information sources seldomly use the same or compatible formats. Therefore you have t…
February 7, 2020
Azure has multiple analytical tools nowadays. In this blog, I wanted to talk about Azure HDinsight and Azure Databricks and give a bit of background on them. One of the main questions is when would you choose one over the other.
First, let’s call it what it is: it’s Apache Hadoop running on Microsoft Azure. This means that we now have a cluster available in the cloud. Starting with some background on Hadoop:
Hadoop: An open-source framework for storing data and running apps on clusters. It offers massive storage for any data, lots of processing power. It can handle virtually “limitless” concurrent tasks. Hadoop has been declared open source and is now named Apache Hadoop.
In Azure, we can pick the following clusters that we may need in certain circumstances:
We can only select one type of cluster during the configuration of the HDInsight. The HDinsight cluster cannot be turned off, so this can result in high costs during low use situations. For Active Directory integration with HDinsight, we need a few components to make it work. You will need the Enterpise security package (ESP). For this, you will also need to deploy Azure Active Directory Domain Services. There is a high availability guarantee from Microsoft.
In short, Azure HDInsight provides the most popular open-source frameworks that are easily accessible from the portal. If you need a combination of multiple clusters for example: HDinsight Kafka for your streaming with Interactive Query, this would be a great choice.
Azure Databricks is a newer service provided by Microsoft. Let’s start with some background information about Spark and Databricks:
Spark: General purpose distributed data processing engine. It can be used for a wide range of circumstances. It uses a lot of libraries that can be used. For example: SQL, machine learning, graph computing, and streaming processing. Spark does not provide storage, only a computation engine. Spark extends the Hadoop MapReduce framework to work in an optimized way.
Databricks: Databricks was founded by the creator of Spark. The team behind databricks keeps the Apache Spark engine optimized to run faster and faster. The databricks platform provides around five times more performance than an open-source Apache Spark. With Databricks, you have collaborative notebooks, integrated workflows, and enterprise security. This will be in a fully managed cloud platform.
Azure Databricks works on a premium Spark cluster. This one is faster than the open-source Spark. Azure Databricks is a PaaS solution. It doesn’t require a lot of admin work after the initial setup. It is providing security thanks to the Azure Active Directory integration without any need for custom configuration. It brings you all the pros that Databricks brings to you only then in Azure.
The choice between Azure HDInsight and Azure Databricks depends on the use case that you want to solve. The biggest one is how are the data scientists going to work? Are they going to work without collaborating then it could be wiser to choose Azure HDInsight. Will, there be a lot of collaborating, then Azure Databricks can bring you the extra mile due to the shared notebooks and readily available workflows.
If you only need a spark cluster, then Azure Databricks will bring you that as it has better performance then an open-source Spark cluster.
If you would like a Kafka based streaming service that is connected to a transformation tool, then the combination of HDinsight Kafka and Azure Databricks is the right solution.
If you have a lot of long running jobs that need high power then Azure HDInsight could be better then Azure Databricks.