Editor’s Note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. We’re just finishing up at Strata + Hadoop World in San Jose this week, where we discussed data platforms and more.
Building or rebuilding a data platform can be a daunting task, as most questions that need to be asked have open-ended answers. Different people may answer each question a different way, including the dreaded response of “well, it depends.” Truthfully, it does depend. But that doesn’t mean you have to guess and use your gut. At SVDS, we believe that you can safely navigate the waters of building a data platform by knowing beforehand what your priorities are, and what specific outcomes you desire.
Examining Your Data
You must understand your data before you can build anything, but many organizations underestimate the importance of this. Failing to understand your data can mean the difference between choosing to embark on a project that you can accomplish, and taking on a project that is in no way tractable. Take these steps to make sure you’re fully informed.
1. Ask yourself if you have the right data. If you find that you don’t, then start identifying potential sources of the data you need. You may already have them within your enterprise. For example, what if you want to find out how long customers are spending on different regions of various pages of your website, but all you have are HTTP access logs? Access logs contain only some of the information you need, not all it. You would probably have to add client-side instrumentation to collect the rest of the data.
2. Contemplate how your data will contribute to solving your business problems. For example, will you use it to generate monthly reports that help you decide where to direct quarterly resources? Or, will you use it to construct a dashboard to help direct your responses to social media? The differentiation here is between real-time and batch-oriented data platforms. Because the outcomes are different, each is built differently.
3. Think about how important your data is to the strategic interests of your company. Do you want to use it to drive decision making, or are you happy to let it confirm decisions that have already been made? Could you use it to construct a feedback loop that would let you observe and decide? If data is core to your strategic vision, you probably want to think about ways of making it secure. This is difficult in the currently evolving landscape of big data technologies, as many open source technologies are not secure out of the box.
Understanding your data is a key component to having a comprehensive data strategy. This strategy will drive what you do with your data. Further, you must recognize that time and resource constraints will limit the number of things you can do with your data in a reasonable amount of time. Focus on those projects that are most important or will bring you the most value.
Design With Specific Outcomes in Mind
It would be easy to construct a list of features you would want in an ideal data platform, and then then set out to build or buy it. Many enterprises are guilty of this practice. Unfortunately, this “checklist of features” mentality gives way to bloated applications with features that are used infrequently. Additionally, complicated platforms are expensive to maintain in terms of training and support maintenance.
Focus instead on assembling a platform that will deliver the outcomes you desire. It does not matter if you are building or buying: well-defined business needs should drive your technical decisions. I have outlined a few of these needs below. Note that this is not an inclusive list.
Big data newcomers sometimes have difficulty understanding the distinction between a traditional data store (NoSQL or traditional RDBMS) and a searchable index, thinking they are interchangeable. This mistake becomes obvious when you try conducting full-text searches on a key-value store or relational database. It doesn’t work because every database record must be scanned. Although you can largely get away with the opposite — using a full-text index as a non-relational database — you want to make sure to use the right tool for the job.
If search performance is something you desire, be sure to include a full text index in your platform. Obvious choices these days are Apache Solr and Elasticsearch, both of which can be distributed across servers.
Data can take on a life of its own. During different phases it is written, read, transformed, aggregated, and maybe eventually deleted. Data that is rarely read can be stored cheaply using “cold storage” systems. The trade-off here usually plays out in terms of very slow read performance. Because of this, you should make sure that the data is archived in a format that requires only simple transforms or no transforms at all.
Some data is always “hot,” meaning it needs to stay in memory for fast access. A few storage systems excel at this, but will obviously require significantly more memory for hardware they run on. On the other hand, if you plan to do a lot of data transformation (e.g.: computing percentiles and other statistics), you will need to make sure that machines running the transformations have enough CPU power.
Many organizations fail to see far enough into the future to plan for the eventual demise of their data. Fearful of throwing any of it away, every last byte is kept and contributes to clogging up data pipelines. It may be wise to consider moving older data into cold storage so that the processing pipeline (which operates on more current data) runs unencumbered and with good performance.
Integration and Exposing Data
Oftentimes, big data platforms exist to feed data into other systems (e.g., reporting, EDW). You want to make sure that your data platform can be easily integrated into these external systems. Data can be exposed through the file system layer (networked drive or HDFS), or via an API. If your data is exposed through an API you want to make sure ahead of time that the client software you will be using to access it is high quality and well supported for your particular language or development environment.
Since APIs are exposed over network interfaces, security must be taken into consideration. Please make sure your APIs are sufficiently locked down.
With whatever platform you end up with, it is important that you retain the ability to iterate and experiment as you construct data pipelines and applications. Any modern platform will be expected to stay online for the next 5+ years, but with a scale and scope more expansive than the previous generation of data systems. Your platform should support both investigative work and your production workloads as the needs of your enterprise change. We’re interested in how the community is tackling this problem — please share your experiences in the comments.