You are viewing the RapidMiner Radoop documentation for version 9.9 - Check here for latest version
Azure HDInsight 4.0
Configuring the Hadoop cluster
RapidMiner Radoop supports version 4.0 of Azure HDInsight, a cloud-based Hadoop service that is built upon Hortonworks Data Platform (HDP) distribution.
If you don't have an HDInsight cluster running in the Azure network, you can follow the Azure documentation to create one. Make sure to select Spark as a cluster type.
Azure Data Lake Storage Gen2 as primary storage and Enterprise security package are not yet supported by Radoop in case of HDInsight 4.0
Hive setup
Complex functionality of Radoop is partly achieved by defining custom functions (UDF, UDAF and UDTF) to Hiveserver2 extending its capabilities.
- Install Rapidminer Radoop UDF Jar files
- Register Hive UDF functions for Radoop
Networking
If your networking allows direct access (DNS and reverse DNS for all hostnames including the alias) to all of the cluster nodes then you can skip this step.
Please follow the general description for networking setup for accessing Hadoop cluster. In case of an isolated network setup, Radoop users will need the connection details for a deployed Radoop Proxy.
Setting up the connection in RapidMiner Studio
We strongly recommend using the Import from Cluster Manager tool to create the connection, as several advanced properties required for correct operation are seamlessly gathered from the cluster during the import process.
- Use  Import from Cluster Manager to create the connection directly from the configuration retrieved from Ambari. Import from Cluster Manager to create the connection directly from the configuration retrieved from Ambari.
- On Hadoop tab, under Advanced Hadoop Parameters provide storage credentials for the primary storage of the HDInsight cluster. - Azure Storage credentials: On the Azure storage dashboard find the Access keys tab. Copy one of the keys and set is as the value of - fs.azure.account.key.<storage_name>.blob.core.windows.netparameter in your Radoop Connection.
- On the Hive tab, enter the Database Name to connect to. Choose a database where privileges for all operations are granted for the given user. Tick UDFs are installed manually. 
- In case of using Radoop Proxy there should be a proxy connection ready to it. As a final step for a Radoop Connection tick Use Radoop Proxy on the Radoop Proxy tab and select a Radoop Proxy Connection which had been created for this cluster. 
