Can i use Hadoop with PHP?

Hadoop is the first choice if you want to start with big data analysis. Getting started with Hadoop is a bit tricky ...

Posted by Fichtl on 20.07.2016

... it's not very easy to install and there are some parts you have to know about like yarn or HDFS ...   

So first: What is Hadoop? Hadoop consists of two important parts, a clustered file system called HDFS and a MapReduce framework for writing MapReduce jobs in Java (or any other programming language). There is a great tutorial online for installing Hadoop on Ubuntu (DigitalOcean). After installing it, you should try if your HDFS works. HDFS comes with many commandline tools you know from normal linux distributions like ls, copy, move and so on. 


Let's create some content in your new distributed file system ... Look at the list of commands in documentation if you want to know more about HDFS. 

HDFS is great if you really work with big data, you can span volumes cross many servers and create so huge virtual disk drives.

# create a directory and list the content of root hadoop fs -mkdir /test-1234 hadoop fs -ls / # create a local text file and put it on your hdfs echo "This is a nice text file." > test.txt hadoop fs -copyFromLocal test.txt /test-1234/test.txt # output a hdfs file hadoop fs -cat hadoop fs -cat /test-1234/test.txt # If you get: "WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform" export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"


MapReduce is an algorithm to crunch big data into small results, first step is to map input data to key/value pairs and second step is to reduce this pairs until only your results are left. 

So it is very simple and thats the reason why you can easy split it in chunks and let it run on multiple cores and even multiple machines. Hadoop is the job-runner framework who splits your jobs and runs them in parallel. Usually MapReduce jobs for Hadoop are written in Java, you can see an example in the official Hadoop tutorial.

But in the current version of Hadoop there is a feature called Hadoop Streaming, it let you create jobs with data from stdin and stdout streams and so you can use any programming language to write your Hadoop jobs.  

Here is a simple example with unix commands cat and wc (wordcount):

hadoop jar hadoop-streaming-2.7.2.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /usr/bin/wc

PHP Example

So how would mapper and reducer look like in php? Here is an example of a simple scripts for counting words in input files.

# mapper.php #!/usr/bin/php <?php while (($line = fgets(STDIN)) !== false) { $words = explode(" ", $line); foreach($words as $word) { echo $word."\n"; } } echo PHP_EOL; # reducer.php #!/usr/bin/php <?php $counts = []; while (($line = fgets(STDIN)) !== false) { $word = trim($line); if(! $word) { continue; } if(! isset($counts[$word])) { $counts[$word] = 0; } $counts[$word]++; } foreach($counts as $word => $count) { echo $word.': '.$count."\n"; } echo PHP_EOL; # Call it ... hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \ -input /test \ -output /test-out \ -mapper /usr/local/hadoop/scripts/mapper.php \ -reducer /usr/local/hadoop/scripts/reducer.php

Hadoop feeds all input files from the -input folder line-wise into the mapper and the output of the mapper it feeds into the reducer and then the output of the reducer are written into text files in the -output folder. Look at the output folder with this commands.

  • hadoop fs -ls /test-out
  • hadoop fs -cat /test-out/part-00000

That was my first big data experiment with Hadoop and PHP.