Tuesday, January 17, 2017

Big Data Experiments - Running Apache Spark on Windows 7

It is a different thing to run Spark on Linux and a very different experience to run Spark in Windows. Last few days had been very frustrating for me from the perspective that I have been trying hard to setup Apache Spark on my desktop and run a very simple example and finally it completed today. In the following post I will be documenting my experience and how anyone else can avoid these problems.

First let me explain my environment:
OS: Windows 7 64 Bit
Processor: i5
RAM: 8 GB

Based on a project requirement I wanted to test I chose the following version of Spark which I downloaded from Spark Website.
spark-1.6.0-bin-hadoop2.6.tgz
As a pre-requisite I had the following version of Oracle Java
java version "1.8.0_25" and JAVA_HOME was setup appropriately.
I use a batch script for the setup which is very handy.
jdk1.8.bat 
 @echo off
echo Setting JAVA_HOME
set JAVA_HOME=C:\jdk1.8.0_25-windows\java-windows
echo setting PATH
set PATH=%JAVA_HOME%\bin;%PATH%
echo Display java version
java -version

And then I setup Scala & SBT which I downloaded from the following links.
scala version 2.11.0-M8 
sbt 0.13.13
Downloaded the winutils.exe based on the advice of this stack overflow answer
http://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows
winutils.exe link 
And then setup the necessary access for c:\tmp\hive based on advice from this blog


Then created a batch script to set it up all
envscala.bat 
@echo off
REM set SPARK & Scala related Dirs
set USERNAME=pridash4
set HADOOP_HOME=c:\rcs\hadoop-2.6.5
set SCALA_HOME=C:\scala-2.11.0-M8\scala-2.11.0-M8
set SPARK_HOME=C:\spark-1.6.0-bin-hadoop2.6
set SBT_HOME=C:\sbt-launcher-packaging-0.13.13
set PATH=%HADOOP_HOME%\bin;%SCALA_HOME%\bin;%SBT_HOME%\bin;%SPARK_HOME%\bin;%PATH%
Then I followed the following command:
>jdk1.8.bat
>envscala.bat
>spark-shell.bat

All started but again all stopped at one error:
    The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- 
This almost wasted one full day and despite trying all the steps I still got this error.

Then I re-read and found this stack overflow post "http://stackoverflow.com/questions/40409838/the-root-scratch-dir-tmp-hive-on-hdfs-should-be-writable-current-permissions" which gave the idea to install hadoop binaries itself and run the below command.

hadoop fs -chmod -R 777 /tmp/hive/;
Thus started my new adventure to install hadoop 2.6 based on the below Apache documentation:
https://wiki.apache.org/hadoop/Hadoop2OnWindows

I downloaded the binaries from Apache website and tried extracting the binaries and copied the winutils.exe to the hadoop bin directory. And though I ran the above hadoop command but when I ran spark-shell again I started getting new errors. And with lot of searching I restored back to the below binaries for Hadoop 2.6 and installed Microsoft Visual C++ 2010 Redistributable Package (x86) package for the correct Microsoft DLL binding for winutils to reflect. And then I re-ran the steps as in the above apache documentation.

Though Hadoop did not start but spark-shell started and I was able to use it.

Now I know this is not as detailed or as concrete an experiment as one would expect but this was helpful where I did not have to rebuild spark & hadoop from scratch for my system.

Hoping that this will be of help to others bye bye and have a great day. 

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...