Sunday, February 28, 2016

Apache Virtual host for hosting a website

First you need to install the Apache server in your server.

Then navigate to /var/www folder and create 

sudo mkdir -p /var/www/example.com/

This will create a folder for our site example.com. Next we have to grant the permission to created folder.

sudo chown -R $USER:$USER /var/www/example.com/
sudo chmod -R 755 /var/www

Then we need to put our website files in that created example.com folder. After that we need to create a configuration file for that. 

sudo cp /etc/apache2/sites-available/000-default.conf   /etc/apache2/sites-available/example.com.conf

Inside the cofiguration file we need to create something like this. 

<VirtualHost *:80>
    ServerAdmin admin@example.com
    ServerName example.com
    ServerAlias www.example.com
    DocumentRoot /var/www/example.com/public_html
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

Some of the parts may already created so you don't need to create everything. Just make sure you will have a configuration file like this. 

To enable the created configuration file you need to enter following command.

sudo a2ensite example.com.conf

After that you need to enter this command to activate.

sudo service apache2 restart

Wednesday, February 24, 2016

Read File as a Stream and Lookup for some Data and Write the Results to a file.

This topic may feel bit difficult to understand. Following diagram will help to understand the concept behind this project.

Using a simple java application continuous file write program is executing. This will act like a continuous data stream to the flink. Lookup file contain some data that we need to check are there any matching records in both files and those records to be written to the output file.

In this scenario first we will look into the source code of continuous file write application. This application is writes some data to this file in every second.

This is the complete java program code.





import java.io.File;
import java.io.File;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Random;
import java.util.Timer;
import java.util.TimerTask;




/**
 * Hello world!
 *
 */
public class App 
{
    public static void main( String[] args )
    {
   new Timer().scheduleAtFixedRate(new TimerTask()
   {
       public void run()
       {
        try
        {
         long Mobilenumber = (long) Math.floor(Math.random()*9000000000L) + 1000000000L;
         int cellID = (int) Math.floor(Math.random()*99)+10;
         
         File file = new File("/home/hadoop/lookup_example/data.txt");
         FileWriter outFile = new FileWriter(file,true);
      
      final DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
      final Date date = new Date();
         PrintWriter out = new PrintWriter(outFile);
         
            out.println(Mobilenumber+"\t"+cellID);//+"\t"+(dateFormat.format(date)));
            out.close();
            
            if(Mobilenumber>5000000000L && Mobilenumber<6000000000L)
            {
             File lookupFile = new File("/home/hadoop/lookup_example/lookupfile.txt");
             FileWriter lookupOutFile = new FileWriter(lookupFile,true);
       
          PrintWriter lookupOut = new PrintWriter(lookupOutFile);
          
          lookupOut.println(Mobilenumber);
          lookupOut.close();
             
            }
        }
        catch(Exception e)
     {
      
     }    
       }
   },new Date(), 1000);   
    }   
    
}
In this program write data to data.txt and lookupfile.txt data.txt file is the file that contain our main data stream while lookupfile.txt will update time to time when a value go beyond the specified limit because we need some data in the lookup.txt file as well.

Next we will move to flink project.
To create a flink project using command line just type the following code. 

mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=0.10.1

This also prompt you those groupId and artifactId parameters to fill. After Successful project creation build a program like this.


import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.functions.KeySelector;

import java.util.concurrent.TimeUnit;

import javax.xml.soap.Node;

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatJoinFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.IterativeStream;
import org.apache.flink.streaming.api.datastream.JoinedStreams;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.FileMonitoringFunction.WatchType;
import org.apache.flink.streaming.api.windowing.assigners.TumblingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.types.Key;
import org.apache.flink.util.Collector;

import pacl.WordCount.LineSplitter;

class class1 {
 String val1;
 String val2;
 
 class1(String v1,String v2)
 {
  val1=v1;
  val2=v2;
 }
}

class class2{
 String val1;
 String val2;
 
 class2(String v1,String v2)
 {
  val1=v1;
  val2=v2;
 }
 
 public String toString()
 {
  return val1+","+val2;
 }
}

class MyKeySelector implements KeySelector{

 @Override
 public String getKey(class2 arg0) throws Exception {
  // TODO Auto-generated method stub
  return arg0.val1;
 }
 
 
 
}

class MyKeySelector2 implements KeySelector{

 @Override
 public String getKey(class1 arg0) throws Exception {
  // TODO Auto-generated method stub
  return arg0.val1;
 }
 
}

public class lookup {
 
 public static void main(String[] args) throws Exception {

  StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  
  DataStream lookupData = env
    .readFileStream("/home/hadoop/lookup_example/lookupfile.txt",1000,WatchType.PROCESS_ONLY_APPENDED)
    .flatMap(new SplitterLookup());
  
  lookupData.writeAsText("/home/hadoop/lookup_example/out_1.txt",FileSystem.WriteMode.OVERWRITE);
  
  DataStream dataStream = env
    .readFileStream("/home/hadoop/lookup_example/data.txt",1000,WatchType.PROCESS_ONLY_APPENDED)
    .flatMap(new Splitter());
  
  DataStream d1 = 
    dataStream.join(lookupData).where(new MyKeySelector()).
    equalTo(new MyKeySelector2())
    .window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
    .apply(new MyFlatJoinFunction());
 
  d1.writeAsText("/home/hadoop/lookup_example/out.txt",FileSystem.WriteMode.OVERWRITE);
  env.execute("Window Wordcount");
 }
 
 public static class MyFlatJoinFunction implements JoinFunction{
  
  public void join(class2 arg0, class1 arg1, Collector arg2) throws Exception {
   // TODO Auto-generated method stub
   arg2.collect(arg0.val1+","+arg0.val2+","+arg1.val1+","+arg1.val2);
  }

  @Override
  public String join(class2 arg0, class1 arg1) throws Exception {
   // TODO Auto-generated method stub
   return arg0.val1+","+arg0.val2+","+arg1.val1+","+arg1.val2;
  }
 }
 
 public static class SplitterLookup implements FlatMapFunction{
  
  @Override
  public void flatMap(String sentence,Collector out) throws Exception{
   for (String word: sentence.split(" ")){
    out.collect(new class1(word," "));
   }
  }
 }
 
 public static class Splitter implements FlatMapFunction{
  
  @Override
  public void flatMap(String sentence, Collector out) throws Exception{
    String[] values = sentence.split("\t");
     out.collect(new class2(values[0],values[1]));
  }
 }
}
    In main Flink documentation there is mention normal flat file can be used in DataStreams but that is not correct. you have to use readFileStream in order to achieve continuous read from the file.


  • It is advisable to create a classes to store the file read outcomes. This will also give easy access the values.
  • You need to create keySelector that will use to match two keys together and decide both are same and write to the file.
  • Also need to use a window to aggregate data together and process.
  • Tuesday, February 23, 2016

    Flink installation

    Download the flink files (http://www.apache.org/dyn/closer.cgi/flink/flink-0.8.1/flink-0.8.1-bin-hadoop1.tgz) to the local home directory and assign the /bin path in .bashrc file. 

    $ tar xzf flink-*.tgz   # Unpack the downloaded archive
    $ cd flink-0.8.1
    $ bin/start-local.sh    # Start Flink

    Check the JobManager’s web front end at http://localhost:8081 and make sure everything is up and running.

    That's it.

    To run  check flink we can use the wordcount example. 

    Download test data:

    $ wget -O hamlet.txt http://www.gutenberg.org/cache/epub/1787/pg1787.txt

    You now have a text file called hamlet.txt in your working directory.
    Start the example program:

    $ bin/flink run ./examples/flink-java-examples-0.8.1-WordCount.jar file://`pwd`/hamlet.txt file://`pwd`/wordcount-result.txt

    You will find a file called wordcount-result.txt in your current directory.

    no you have the working copy of flink in your local machine. 


    Creating a Java project using maven using command line

    First you need to have a install and properly configured maven instance. To check your installation copy issue this command  and check you are getting proper version details. 

    mvn -version

    if you are having difficulties in this please go through the steps mention in this link.

    http://kingalawakatech.blogspot.com/2016/02/install-maven-in-cent-os.html

    If all sounds great then we can jump in to creation of a project. 

    Type following command to get started the project creation. Please make sure that you connected to the internet because in this maven project creation sometimes need to download some files from the internet. 

    Before creating a project navigate to the location that you want to create this project.


    mvn archetype:generate

    This will prompt the type of projects that you can build from the maven and then you can select the number appropriate to your requirement. Then it will begin the project creation. In this creation process you will be asking for enter some details of the project. Such as group Id, artifact Id etc. This is a  small explanation about those parameters.

    groupId : ID of the project group. (com.company.app)
    artifactId : project name (myApp)
    version : version of the project. (1.0)
    package : this is the typical java package. please avoid keywords in this.

    after giving those values you can successfully create new java project.

    To clean the maven project you can use
    maven clean 

    To Build the project you can use 
    maven package   

    Install Maven in Cent OS

    First of all you need to download the latest version of maven. For this requirement you can follow this link. 

    http://mirrors.gigenet.com/apache/maven/maven-3/

    Select the appropriate version that you want to use. Please make sure you download it to your local home directory to avoid unnecessary permission issues.  

    tar -zxvf apache-maven-3.2.3-bin.tar.gz 
    cd <location that you extract the zip file>
    sudo ln -s apache-maven-3.2.3 maven

    The next step is to setup the Maven environment variables in a shared profile so all users on the system will get them import at login time.

    su -c "vi /etc/profile.d/maven.sh"

    # Add the following lines to maven.sh
    export M2_HOME=/usr/local/maven
    export M2=$M2_HOME/bin
    PATH=$M2:$PATH

    you also need to edit your .bashrc file and add environmental variables. Then issue 

    source .bashrc 

    command to get effected the changes.

    Then issue following command to check whether you have install the maven correctly.

    mvn -version

    Analyse Tweets using Flume, Hadoop and Hive

    This is a great tutorial for work with all these new technologies. Please follow the link bellow to get hands on experience in this area.

    And before this step you also need to create a directory in HDFS. You can use Linux file system commands in HDFS. For that this tutorial will be helpful. 

    http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

    This is the tutorial that you need to follow to experience real data from Tweeter . 

    http://www.thecloudavenue.com/2013/03/analyse-tweets-using-flume-hadoop-and.html

    When you creating this conf/flume.conf file please make sure no space between left and right hand values. 
    e.g : 

    //this will not worked
    TwitterAgent.sources = Twitter
    TwitterAgent.channels = MemChannel
    TwitterAgent.sinks = HDFS

    //this is the correct format
    TwitterAgent.sources=Twitter
    TwitterAgent.channels=MemChannel
    TwitterAgent.sinks=HDFS


    Install Hive on Cent OS

    Please use directory in your home directory because sometimes you may feel some issues regarding permission. 

    This is the link that i used for this installation and this works fine for me. Please ask any question that you may come across.

    Hadoop Intallation on Cent OS

    This is a really easy task but you should be careful. I will post the link to the tutorial that i followed to installed. 

    1. Put your all files inside of a folder in your home directory. In some tutorials they put those files in /usr and other locations. This will be a problem sometimes because of insufficient permission to execution. 
    2. Please make sure that you properly done that ssh key generation part because it is really needed when it comes to working with hadoop. That ssh key procedure will allows to login to localhost with out authentication. you can check this ssh key task is properly effected by issuing 
              ssh localhost
              this will prompt last time you log successfully.

          This is the tutorial that i followed and if you have any question please post here. 

    http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/

    Thanks.