Using Docker to Build a Data Acquisition Pipeline

with Kafka and HBase  |  March 3rd, 2015

It’s hard to miss that Docker has been taking off lately. Distributed computing has become ubiquitous, but the tools for developing in a distributed environment are still evolving. It can be a challenge to see a multiple-system-spanning application through development, testing, and deployment. Virtual machines provide a useful simplifying abstraction, allowing software dependencies to be configured independently of physical hardware. Containers carry this a step further by running in an isolated process on the host OS, bypassing the need to emulate virtualized hardware, and Docker is essentially a tool for managing Linux containers. Docker also provides functionality for sharing container images, aiding collaboration.

In this post, I’ll walk through an example of using Docker to develop a data acquisition pipeline to ingest mobile app GPS data using Kafka and HBase. This is based on part of the backend infrastructure for ourCaltrain Rider app which helps riders find out when to catch the train, but the methods I’ll talk about are broadly applicable. We’ll demonstrate several Docker containers working seamlessly together: a Java REST server for acquiring data, a Kafka instance for managing messaging of GPS data to multiple consumers, an HBase server for log storage, and a Zookeeper instance for managing Kafka and HBase. Thanks to Docker Compose(formerly fig), we’ll see that configuring and managing all of these containers is also quite simple. All of the code for this post is available on github.

Sittin’ on the Docker Container

If you’re following along, first make sure you’ve installed Docker (on OSX/Windows, boot2docker) and Compose. Go ahead and clone the github repository, which has more detailed instructions for installing these dependencies, and all the code required to build the project. The repository also has the final versions of the Dockerfiles and docker-compose.yml that we will be writing from scratch in this post. This example uses boot2docker on OSX; minor modifications to some of the commands shown will be necessary on Linux or Windows.

Let’s start small with a single container: the REST server. As mentioned above, this server provides an endpoint for GPS data from a mobile app (and, in the next section, also acts as a Kafka producer). In this scenario, we already have a working Java server which is built with Maven, and we want to run it in a Docker container. We can do this by creating a Dockerfile in the project directory with a single line:

FROM maven:3.2.5-jdk-8u40-onbuild

This line tells Docker to use a specific image from the official Maven repository, which can be easily replaced with other Maven or JDK versions (similar images are available for your language of choice). The -onbuild suffix refers to an image which will automatically add the project directory to the container and run mvn install. Now we can build and run the container with Docker, from the geolocation project directory:

docker build -t svdsdemo/geolocationservice .
docker run -p 8080:8080 svdsdemo/geolocationservice java -jar target/geolocationservice-1.0-SNAPSHOT.jar

The first line tells Docker to build the Dockerfile in the local directory and tag the resulting image as svdsdemo/geolocationservice. The second line runs that image, forwarding the image’s port 8080 to the Docker machine’s port 8080, and uses the command java -jar target/geolocationservice-1.0-SNAPSHOT.jar to run the application. If you’re using boot2docker, the Docker machine is not the local host, but rather a virtual machine. You can find this machine’s ip address by running boot2docker ip. Now we can verify that the rest server is running by asking for a test value to be returned as a JSON object:

$ curl http://$(boot2docker ip):8080/geoLocation/v1/testValue?a=b

  "a": [

And it works! However, the real goal of this service is to write GPS logs to HBase. We can try to PUT some GPS data:

$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/200/1.0

But we’ll get an error that we’re unable to connect to Kafka. We’ll come back to this in the next section, but first let’s look at a couple of tweaks to our current container.

In the current build, if you make any changes to the Java source, Maven has to start from scratch downloading every dependency. This doesn’t make sense because our dependencies don’t change very often, so we can get faster builds by taking advantage of Docker’s built-in caching. Our updated Dockerfile does away with the Maven -onbuild image, in favor of building just the POM first (which changes only occasionally), then building the rest of the code (which changes more frequently), so that the dependencies from the POM are already in the image when we build the code:

FROM maven:3.2.5-jdk-8u40

RUN mkdir --parents /usr/src/app
WORKDIR /usr/src/app
# selectively add the POM file
ADD pom.xml /usr/src/app/

# get all the downloads out of the way & cached
RUN mvn verify clean --fail-never
ADD . /usr/src/app
RUN mvn verify

At this point, we’ll make one more addition to make our lives much simpler once we start adding in more containers: docker-compose. In the root project directory, we’ll create docker-compose.yml:

  build: ./geolocationservice/
  command: java -jar target/geolocationservice-1.0-SNAPSHOT.jar
    - "8080:8080"

This puts all the arguments we were giving to Docker into a configuration file, allowing us to start up our container by running:

$ docker-compose build
$ docker-compose up

So far this isn’t much of a simplification, but it will really pay off in the next section when we start doing container linking!

Kafka Trial

Now we’re ready to introduce Kafka. Thanks to publicly available Docker images, all we need to do is add a few lines to our docker-compose.yml:

  image: oddpoet/zookeeper
  hostname: zookeeper
    - "2181"
    - "2181:2181"
  image: wurstmeister/kafka
  hostname: kafka
    - "9092:9092"
    - zookeeper:zk
  build: ./geolocationservice/
  command: java -jar target/geolocationservice-1.0-SNAPSHOT.jar
    - "8080:8080"
    - kafka

With almost no work, we have three containers, all talking to each other. The magic step is the link property, which maps a hostname to the target container. For example, in the service container, the hostname kafka will link to the kafkacontainer. The zookeeper and kafka images I chose use slightly different configuration properties to set the server port: the zookeeper image takes the port as a command argument, while the kafka image takes it as an environment variable. Now when we hit the API:

$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/200/1.0

  "id": "FAKEID",
  "latitude": 40.11,
  "longitude": 88.27,
  "epoch": 200,
  "accuracy": 1.0

We get the expected output!

Next we’ll verify that Kafka is actually receiving the GPS logs by adding a simple consumer. Like the server, this has already been built as a Java project, and we can reuse the same Dockerfile. We just append the following lines to our docker-compose.yml:

  build: ./genericconsumer/
  command: java -jar target/genericconsumer-1.0-SNAPSHOT-jar-with-dependencies.jar --zookeeper zookeeper:2181 --groupid 11 --topicname GeoLocation --threads 2 --consumerclass com.svds.genericconsumer.consumers.BasicConsumer
    - kafka
    - zookeeper

Note that this service is configured via several command-line arguments, which aren’t relevant to this post. The end result is a very simple Kafka consumer that writes the messages it receives to stdout. We can see it working by running:

$ docker-compose build
$ docker-compose up -d
$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/200/1.0
$ docker-compose logs basicconsumer
$ docker-compose stop

Here, docker-compose up -d runs the container in the background, and docker-compose logs [service] shows us the logs from just the specified service. If the message has been received, we will see a line likebasicconsumer_1 | CONSUMER: FAKEID/40.11/88.27/200/1.0 in the log. We now have four containers working together, and in the next section we’ll add HBase.

Turn up the HBase

You might be noticing a pattern here: we’re going to add a few more lines to ourdocker-compose.yml to start using HBase.

   image: kevinsvds/hbase
   hostname: hbase
     - "9090:9090"
     - "9095:9095"
     - "60000:60000"
     - "60010:60010"
     - "60020:60020"
     - "60030:60030"
     - zookeeper
  build: ./genericconsumer/
  command: java -jar target/genericconsumer-1.0-SNAPSHOT-jar-with-dependencies.jar --zookeeper zookeeper:2181 --groupid 22 --topicname GeoLocation --threads 2 --consumerclass com.svds.genericconsumer.consumers.GeoLocationConsumer --parameters zk=zookeeper,hbaseTable=gps-test
    - kafka
    - zookeeper
    - hbase

Here, the hbase image is a fork I made of oddpoet/hbase-local that includes a thrift API, which we can use to look at HBase tables with Python. Thegeolocationconsumer service is identical to genericconsumer, but with different command-line arguments that configure the connection to HBase.

Another thing I haven’t mentioned yet is configuring the hostname. This actually turns out to be quite important for HBase (and caused me some headaches), because the Java API gets the connection information to HBase by asking Zookeeper for the HBase server’s hostname, so the hostname of the HBase service must be resolvable by the Java service. By default, the service hostname is set to the container id, which the geolocationconsumer does not recognize. However, thanks to container linking, it will recognize the hostname hbase.

Let’s bring up the new image and insert some data:

$ docker-compose build
$ docker-compose up -d
$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/1420183320000/1.0
$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/1420183321000/1.0
$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/1420183322000/1.0
$ curl -X PUT http://$(boot2docker ip):8080/geoLocation/v1/setGeoLocation/FAKEID/40.11/88.27/1420183323000/1.0

There are several ways to verify that this worked. Here we’ll do it by connecting to HBase with python. If needed, pip install happybase, then create a short python script (or run these commands in iPython):

import happybase
# connect to HBase on the boot2docker ip
connection = happybase.Connection('',9090) 
table = connection.table('gps-test')
for k,data in table.scan():
    print k, data

If everything worked, we should see the logs we just sent showing up as rows in the HBase table.

2015-1-2-1420183320000-FAKEID {'locations:id': 'FAKEID', 'locations:latitude': '40.11', 'locations:epoch': '1420183320000', 'locations:longitude': '88.27', 'locations:accuracy': '1.0'}
2015-1-2-1420183321000-FAKEID {'locations:id': 'FAKEID', 'locations:latitude': '40.11', 'locations:epoch': '1420183321000', 'locations:longitude': '88.27', 'locations:accuracy': '1.0'}
2015-1-2-1420183322000-FAKEID {'locations:id': 'FAKEID', 'locations:latitude': '40.11', 'locations:epoch': '1420183322000', 'locations:longitude': '88.27', 'locations:accuracy': '1.0'}
2015-1-2-1420183323000-FAKEID {'locations:id': 'FAKEID', 'locations:latitude': '40.11', 'locations:epoch': '1420183323000', 'locations:longitude': '88.27', 'locations:accuracy': '1.0'}

From here, we could shut down the services by running docker-compose stop, but suppose we wanted to make some changes to the consumer without bringing down and restarting every other service. We could instead run docker-compose stop basicconsumer, make a few changes to the source, then docker-compose build basicconsumer and docker-compose up -d --no-deps basicconsumer (the --no-deps flag tells docker-compose not to recreate the dependencies, because they are already running).

Also, by default, running docker-compose up will recreate the containers, which will restore everything to a clean state—in particular, it will destroy any data currently saved in HBase. If this isn’t desired, you can instead run docker-compose up --no-recreate or docker-compose up --no-recreate -d to run in the background.

This is just a starting point, and there’s a lot more that can be done! For example, you can drop in additional Kafka consumers to do additional processing on the GPS logs, or on entirely different log data. Or you could scale HBase (or Kafka) from a single-instance pseudo-distributed install to a fully distributed cluster. Or you could add on Impala or Hive to query the the log data.

Here’s a helpful cheat sheet of common commands for reference as you go back through what we covered in this post. Have fun!