I was at the Strata-Hadoop conference in Barcelona at the end of November, and noted a few trends in Big Data analysis which may be of general interest.
Hadoop platform take-up was assumed
In the early stages of a tooling lifecycle there seems to be a level of evangelism amongst people who get together to talk about it. Hadoop has passed that stage. Admittedly this was a Hadoop event, so you would expect a higher degree of familiarity than elsewhere, but I was surprised at just how mainstream Hadoop felt.
Spark jumps the gap
While MapReduce on Hadoop allows very fast efficient number crunching, it isn’t ideal for complex queries or computations. It’s good for production scale implementation of pre-defined calculations, but not for the explorations necessary to define the process. Data scientists, including me, were talking about use of Spark to fill this gap – it has memory resident data for fast multi-machine processing, comes with a machine learning library, provides easy extension from R or Python and can run the complex tasks needed for investigative analysis.
Storm rumbling in the background
Real-time import of streamed data was a constant theme. There was a lot of discussion on just how much real-time processing is actually necessary – data may stream in, but results are often not needed in anything like real-time. The use of new architectures (including lambda and kappa) to handle this data pipe were discussed. While Storm for genuine real-time processing was mentioned, most speakers seemed to find reasons for their process outputs to be batch, or micro-batch, and so avoid moving to genuine real-time processing.
Space – the final frontier – again
Algorithms used to have to be designed to work with the limited storage space of early computers, for instance by holding a few summary variables, or acting on data subsets. As disks became cheaper, this constraint was removed. Now, as data storage and processing are memory resident, well designed storage efficient algorithms, and procedures are becoming necessary again. Experience from previous generation computing suddenly looks extremely useful again.
Biggest trend – bringing the data to the consumers/domain experts
Historically, consumers would ask questions, then data experts would go away, collect and store data, work out how to process it and then produce reports (hopefully) answering the questions. The trend is for consumers to query the data themselves. This places a number of requirements on the data capture, and processing – it has to be cleaned and stored to facilitate the queries, with flexible transparent tooling that ensures the results obtained by a (probably statistically uneducated) query are not misleading. The role of the data scientist changes from report generator to educator, adviser and validator.