All Blog Posts

Strata-Hadoop World Summit Recap

A few weeks ago, I stopped by Strata-Hadoop World in San Jose, a meeting of thousands of data scientists, programmers and developers celebrating the infinite possibilities of Big Data. But, at the risk of sounding glib and out of place, don’t use that phrase. The term Big Data is quickly reaching tired cliché status, almost on par with “Cloud.”


The term “Big Data” connotes millions or billions of pieces of data that are difficult or impossible for traditional processing tools, like your laptop computer, to handle.

The cutting edge of distributed data storage and computing is not in just processing those “Big” data sets, but doing so at incredible speeds and in the collation of unstructured data. Apache Hadoop itself is the idea of using huge fields of computers, all working on pieces of a processing project. Wrangling all those processors makes huge computational jobs easier and faster.

One presenter talked about querying and aggregating billions of lines of a table in less than 5 seconds. He did so passingly, and then got to the good stuff.

He’s right, the most interesting problems aren’t in tables. Think about it. Most of the world doesn’t exist in the neat structure of CSV and Excel files. It’s in legal documents, or architectural plans or large images. This event is about discovering new ways of using the combined processing power of thousands of commodity computers to store and compute peta-bytes that type of data: images, videos, audio and TIFF files.

But how do you use that information? How do you create applications that every day users can use to make better decisions?

That’s what the smart teams attending Strata-Hadoop are working to solve. The real world applications are myriad. One big one: the advancements made in distributed processing are moving computer scientists closer to Machine Learning and “AI.” After all, a human mind doesn’t process tabular data, it processes a huge variety of sensory inputs. In fact, the event took place right after Google’s AlphaGo beat world-champion “Go” Player Fan Hui 5-0 in a man v. machine matchup reminiscent of IBM’s DeepBlue and Ken Jenning’s Jeopardy showdown.

In addition to its board game applications, distributed data processing is a sought after commodity. Businesses want this ability to bring together “Data Lakes,” or information from different parts of the company that don’t play well together. Data scientists want the ability to collaborate across different fields of study to make new discoveries, using data that traditionally hasn’t been compatible.

Judging by the massive attendance and buzzing atmosphere, this is a sector poised for more growth in the near future.