Urban Data Hackathon – TfL Datasets

Spent the weekend, appropriately, on a busman’s holiday looking at data from the TfL data sets.

Saw a few interesting approaches. Some would really shine with a couple of extra days to complete the analysis.

Here’s the presentation Phil & I produced of our recommendations for the underlying topological reference data model.

Urban Data Hackathon – TfL Datasets

My summer 2015 paintings…

We spent some time at Lyme in the summer. Here’s how I unwound. Click on the images for a bigger view

For anyone interested, oil on board, started plain air (i.e. from life out in the open) finished off at home to insert details.

All the images are Copyright Mary-Ann Claridge 2015. All Rights Reserved

My summer 2015 paintings…

Toys or Tasks – Questions to consider when tooling up for Big Data

From my keynote talk at the Cambridge Wireless event:

‘Delivering Big Data: Practical solutions with an emphasis on open source’ – Big Data SIG event – London – 23 April 2015

The slides are now available from the CW website

While giving an overview of the tooling available, I presented questions that should be considered in selecting tools for Big Data projects:

Where is the area of difficulty? ‘Traditional’ tools such as Excel may be quite adequate for a quick look at a sample of data, or familiar tools such as R may allow domain experts to model data and devise processing algorithms more quickly.

CW - Toys - Investigation

How fast will data be received, and how much will there be? This will determine whether tools such as Hadoop are needed.

How soon will processed results be needed? If this a genuine streamed application, Storm should be considered, but often, what is initially thought of as a streaming project is simply a data capture problem, with a ‘fast batch’ processing chain.

CW - Toys - Production

Different tooling may well be used at investigation and production stages. While re-use is desirable, it may sometimes be better to use the best tool for each stage, making a clean start into production after the task is clearly defined.

Toys or Tasks – Questions to consider when tooling up for Big Data

So, now I’m a Unicorn…

At the Hortonworks seminar on Hadoop for Modern Data Architectures, today, there was a lot of discussion around the skills required to get maximum value from the floods of data pouring in from sensors, websites, machine logs, phone data and the rest of the BigData deluge. At the Hortonworks Data Science workshop, the experiences needed in a Data Scientist were listed as

  • Research Scientist
  • Engineer
  • Platform selector
  • Analyst
  • Integrator
  • Database schema designer

Apparently this skill set is so rare that the Hortonworks team recommend their customers recruit data science teams, rather than hunting this mythical Unicorn. From their presentationunicorn For years, I’ve struggled to know what label to put on my CV – apparently I should say Unicorn. Continue reading “So, now I’m a Unicorn…”

So, now I’m a Unicorn…

Strata-Hadoop in Barcelona – Impressions & Trends

I was at the Strata-Hadoop conference in Barcelona at the end of November, and noted a few trends in Big Data analysis which may be of general interest.

Hadoop platform take-up was assumed

In the early stages of a tooling lifecycle there seems to be a level of evangelism amongst people who get together to talk about it. Hadoop has passed that stage. Admittedly this was a Hadoop event, so you would expect a higher degree of familiarity than elsewhere, but I was surprised at just how mainstream Hadoop felt.

Spark jumps the gap

While MapReduce on Hadoop allows very fast efficient number crunching, it isn’t ideal for complex queries or computations. It’s good for production scale implementation of pre-defined calculations, but not for the explorations necessary to define the process. Data scientists, including me, were talking about use of Spark to fill this gap – it has memory resident data for fast multi-machine processing, comes with a machine learning library, provides easy extension from R or Python and can run the complex tasks needed for investigative analysis.

Storm rumbling in the background

Real-time import of streamed data was a constant theme. There was a lot of discussion on just how much real-time processing is actually necessary – data may stream in, but results are often not needed in anything like real-time. The use of new architectures (including lambda and kappa) to handle this data pipe were discussed. While Storm for genuine real-time processing was mentioned, most speakers seemed to find reasons for their process outputs to be batch, or micro-batch, and so avoid moving to genuine real-time processing.

Space – the final frontier – again

Algorithms used to have to be designed to work with the limited storage space of early computers, for instance by holding a few summary variables, or acting on data subsets. As disks became cheaper, this constraint was removed. Now, as data storage and processing are memory resident, well designed storage efficient algorithms, and procedures are becoming necessary again. Experience from previous generation computing suddenly looks extremely useful again.

Biggest trend – bringing the data to the consumers/domain experts

Historically, consumers would ask questions, then data experts would go away, collect and store data, work out how to process it and then produce reports (hopefully) answering the questions. The trend is for consumers to query the data themselves. This places a number of requirements on the data capture, and processing – it has to be cleaned and stored to facilitate the queries, with flexible transparent tooling that ensures the results obtained by a (probably statistically uneducated) query are not misleading. The role of the data scientist changes from report generator to educator, adviser and validator.

Strata-Hadoop in Barcelona – Impressions & Trends

Other newspaper articles based on my analysis of BestCourse4Me data

Times Higher Education:

Girls score higher than male coursemates on Ucas points


But data show that grades in each subject may be rather more equal

Lower-grade foreign queue-jumpers a ‘myth’


Overseas students beat UK peers’ tariff scores on many courses, analysis suggests

Other newspaper articles based on my analysis of BestCourse4Me data