What big data doesn’t show us: Experts share their views

02nd Sep `15, 11:00 AM in Resources

At RISE, one of the most influential tech conferences in the world, Tim Culpan, journalist with Bloomberg moderated…

Mridula Pai Contributor

At RISE, one of the most influential tech conferences in the world, Tim Culpan, journalist with Bloomberg moderated a panel discussion on 31st July 2015. On the panel were, Raymond Russell – CTO/Co-founder of Corvil, Linda Jiang – Vice President of Umeng, and Suresh Shankar, Founder of Crayon Data. They came together to discuss their views on what’s next in the world of big data.

Raymond Russell is Founder & CTO of Corvil. He is also the co-inventor of its core technology. As Chief Technology Officer, Raymond is focused on driving continued advancement and fulfilment of Corvil’s innovations with a focus on applications and infrastructure performance.

Linda Jiang is VP at Umeng and has more than 15 years of commercial and technical experience in the mobile & internet industries. She also has significant international business development and product management experience with proven success in the UK and China.

Suresh Shankar is a big data and analytics evangelist, entrepreneur and innovator; he established his second start-up, Crayon Data in Singapore in 2012. Recognised today as one of the world’s top big data companies, Crayon is on a mission to simplify the world’s choices with its SimplerChoicesTM platform and products – MAYA and YODA.

Rise discussion

Tim asked each of the panellists, four questions:

1. What specific problem or challenge does your company’s approach to big data solve?
2. What problem or challenge has now arisen as a result of technologies or products we’ve developed?
3. What real world problems, human problems can big data NOT solve, and why not?
4. How do you approach the balance between quality and quantity of data in your business? How have you got it wrong, and right?

1. What specific problem or challenge does your company’s approach to big data solve?

Linda: When a developer launches an app on a marketplace it is the first time it has been placed in the hands of users outside a very small group of initial testers. How will the majority of users make use of the App? Will Users find the app intuitive to use, or will they struggle to navigate even the basics or the app crashed all the time? Who are your users? Are they the demographic you were aiming for or something else entirely? Who are your high value users? How many users continue to use your apps in the last 7 days? What is the best user acquisition stagey? How to provide personalized content/notification to users? Umeng analytics turns users into 1000s, or 100 millions of testers providing essential user feedback to the developer that in the past was impossible to attain. Umeng Analytics helps developers know their users better, help them engage their users efficiently and maximize monetization opportunities.

Raymond: Modern businesses are automated, electronified, and conducted at an ever increasing pace. We’re making decisions, reacting to changes, and innovating at ever shorter timescales. In this context, the network represents both an opportunity and a challenge: if you’re looking for data on how your business, the network is an ideal place to gather it. The network is how you connect to the outside world, and all your transactions, all your interactions with your customers and your counterparties, flow through the network. If you could tap into that data, you could get an authoritative independent view of the state of your business, of how well it is performing, and of how well the infrastructure it’s running on is operating and delivering to the business.

That’s the opportunity. The challenge is that raw network data is incredibly difficult to deal with: huge volumes of data traverse the network at tremendous rates, and the business data that you’d like to get access to is encoded in wildly different formats and protocols, is often encrypted, and is tunneled in multiple different networking layers, physical and virtual. It’s raw, unstructured, and generally opaque.

That’s the challenge that Corvil is addressing: we create a platform that captures network data, decodes the application and business content from it, structure it and enhance it with analytics, and make it accessible. The GUI supports human operations and engineering teams while the API can feed the data and analytics into other data systems.

Suresh: The problem we are solving in my start-up Crayon Data is the problem of choice. Today, we have too much choice in every aspect of our life. More is becoming less. Can we provide fewer relevant more personalized choices to every individual on the planet in a way that converts the ‘misery of choosing’ (think 8-10 searches, 10-12 reviews often conflicting, 45 minutes of time for every decision you make today from dining to shopping to entertainment to investments…) to the ‘magic of choice’.

To do this, we have ‘mapped the world’s tastes’ like Google has mapped the world’s roads. We know almost every restaurant, movie, book, shopping, event, etc in the world, and we have created affinities between them. So if we know you like 3 restaurants in HK, we can tell you the most likely set of 5 restaurants you will like in Rio de Janeiro. We have done this using our unique Choice Equation {choice = f (Taste, Influence, Context and Behaviour)}

2. What problem or challenge has now arisen as a result of technologies or products we’ve developed?

Linda: 64,000 apps are using our analytics services, App DAU in our platform is 600M, and we process 7.9 billion sessions every day. With rapidly grow of data volume, a lot technical challenges have now arisen e.g. how to ensure data transfer quality, how to efficient store large volume of data, how can achieve real time data processing. What is the best solution for data backup and recovery?

Data privacy is also a challenge for all data companies. Particularly for a mobile data company. Mobile phone is so personal and full of personal data. For a data company, how to draw a line to protect end users personal data, but also maximum the power of data?

Suresh: I can think of 4.

A. These kinds of technologies use powerful algorithms, but because so much is automated, it can put a lot of traditional IT, analytics and business people out of work. It’s like Google makes librarians a bit less useful. While it is true that this kind of big data technology creates new kind of jobs (think computer scientists, data scientists), and creates a lot of them (the data growth rate far outstrips our ability to process and analyze it), the fact is it is also a job destroyer

B. such analytics creates a vast filter bubble, where algorithms show you “more of the same”. It can create a narrow worldview, which is hence distorted. It destroys serendipity and discovery. The counter-point is: the same algorithm can actually increase serendipity and discovery, it depends only on how you set the parameters

C. A few people who control the algorithm will control the world. Think what happens every time FB or Google changes their algorithm. Frightening.

D. Where do you draw the line between privacy and personalization? Isn’t this all too intrusive. Yes, and we should be careful, but it is also a choice. Today the trade-off is implicit and unknown to you. If you made it explicit, then it will be easier. E.g., if you had to pay a cent for every Google search you made, would you prefer to get it for free and allow them to show you a few ads. The vast majority of the world may say yes. Yet it is the fact that individuals do not have control over this data that is the challenge. But new laws and businesses and models are emerging that will address this issue.

3. What real world problems, human problems can big data NOT solve, and why not?

Raymond: The essence of technology automation – harnessing machines to do the grunt work faster, more reliably, and more cheaply than humans can. There are lots of different ways to look at the rise of big data, at what has made it successful, but a key part of it has been automating the grunt work involved in data analysis. It’s allowing more people to access more data more easily and more flexibly than ever before; it allows people to bring their creativity and insights to bear on the problems illuminated by the data. It has allowed more people to access data – you don’t need to be a DBA to touch the data, you don’t need to be an expert in SQL table design to run complex queries – and has allowed data scientists to focus on the data content rather than on the mechanics of dealing with data. Given that perspective, it makes sense that the kind of problems that big data can solve are those where lack of data or analysis capability are the bottleneck, while big data cannot solve problems that lie in the creative realm.

In the context of what Corvil does, the data we gather from the network can help identify where there are issues with the business, we can locate where infrastructure may be impeding the business, allow them to tackle and solve customer experience problems, for example. But it doesn’t necessarily help them come up with new business ideas, with how to attract and win new customers. Of course, other data such as demographics and social media may well help inform that process, but there’s no substitute for the creativity required in innovating.

To give a concrete example from one of the industries we have done a lot work on, electronic trading is a highly competitive business that depends sensitively on the speed with which market participants can react to changing market conditions and execute their buy and sell orders faster than their competitors. We provide critical data about how successful their trades are in the context of the performance of their infrastructure and that of their competitors. We can identify if their trades are unsuccessful, and to what extent the failures to execute are due to infrastructure performance problems, and traders can use that data to modify the behaviour of their algorithms. However all that data can’t help them come up with new trading strategies to pursue, new assets to add to their repertoire, or new markets to operate in.

Linda: The usefulness of “big data” is becoming more and more apparent, and I think we will see more use of data in helping to automate mundane or simple tasks.

For instance big data can or will in the future be able to go out and scour the internet, or offline space and gather together 5 beautiful wedding dresses specially selected to meet her demands and personal preferences. The dresses will all be great, but….big data will never be able to make the final choice, which dress, if any, the Bride finally chooses, will be based on so much more than a line of code will be able to encompass.

Suresh: It can solve an awful lot of them. Evidence based approaches are becoming common in areas like health care, match-making, smarter cities, water management, philanthropy, climate change …. It works for many small real world problems like poverty alleviation, where many experiments can be modelled and measured scientifically.

But if there is any problem that requires human intuition, irrationality, empathy, emotion… technology cannot help. At best, big data can help mimic it or simulate these, but it can never substitute it. A doctor sometimes knows what is wrong without a single test. A mother can tell that the child is troubled. A great leader will find a way to inspire people to do things that can never be rationalized.

If there is a need to make a leap of faith, a machine cannot do it.

And the one things human being have which machines lack: a sense of humour!

4. How do you approach the balance between quality and quantity of data in your business? How have you got it wrong, and right?

Suresh: IMHO, this is a false trade-off. In most cases, more data wins. But if you want in-depth highly specific answers, then you need quality.

Raymond: As Suresh commented, more data is usually better – but only on the ingestion side! A fundamental principle in statistics is that you’ll get better estimates and more reliable inferences the more data you use in your analysis. This works well if you’re doing statistics, and feeding more data into the same analytics; however if you’re just plugging more data sources in parallel and hosing them down the line to a human operator, then you can easily overwhelm them.

A simple example of this is in alerting: our platform can detect anomalies, threshold violations, and so on, and raise them as alerts, either to a human operator or a fault collector. When we introduce a new analytic in our platform – we might expand the range of protocol errors we detect, or discover a useful correlation that provides an early indication of an incipient problem – and use that new analytic to generate alerts, it’s easy to light up the fault collector dashboard. That noise is counterproductive – it’s distracting, and can swamp other alerts that are better tuned to guide remedial action. These are an example of more data that is not necessarily better data; the right way to deal with this, of course, is more analysis – maybe to aggregate these alerts into related groups, something that an operator can digest and act on as whole.

On the question of data quality, it turns out that you need to consider carefully how you collect data from the network. When drinking from a fire hose, you’re bound to spill some, and the network is absolutely a fire hose of data. The risks are that you drop data – this is obviously a problem, but luckily also a very obvious problem; more problematic are ways that the data ends up compromised in non-evident ways. Maybe the timestamps you’re using as the basis of your forensic analysis are compromised, and your whole analysis of cause and effect is thrown off. What appear to be production issues may actually be problems in your data collection.

This is something that we have done well at Corvil: we health-check all our data-sources in multiple ways, and ensure that we detect any issues in the data collection process. It’s something our competitors have not done as well, and has been a strong differentiator for us.

Linda: If I am honest we are always trying to get more of both! Every 6 months we run our customer server to get feedback from our developers about our products and their requirements. Apart from Umeng analytics, we also provide a “feedback SDK: to enable developer to provide feedback function in their app. so they get instant feedback from their users about their app. We are far from satisfied and always trying to push the data forward to help our users.

Having said that quantity is nothing without quality. If the data is of poor quality then it is useless to our users. Our users demand accurate data and we demand that also from ourselves.