We’d like to hear your story of how you got into data science. What motivated you to work in data science?
As a professional research astronomer, I have always worked with data. For nearly 20 years, I worked with a lot of data in my day jobs: I worked on large scientific data systems with NASA, which included work on the Hubble Space Telescope and the National Space Science Data Center. During those years, I began to notice the rapidly growing size of the data sets that we were collecting from different experiments. At the same time, I became aware of similar data growth patterns in other industries, organizations, and domains. As I investigated the consequences of the volume of this massive data, I started to read and research methods of data mining and machine learning algorithms to handle these massive data sets.
I have been interested in scientific discovery from data for as long as I can remember, and now, there was a huge opportunity to discover the potential of data, in all disciplines and application areas. It was incredibly fascinating and exhilarating to be at the leading edge of simultaneous major scientific and business revolutions. I was working on these topics for several years and, by 2002, I decided to focus all of my attention and efforts in the direction of data science.
We know that enterprise data science is not just about HADOOP, dashboards, reports, ad-hoc queries, models or algorithms, it’s beyond all that. Can you please explain what enterprise data science is all about?
For me, enterprise data science is similar to the enterprise I.T. In other words, the functions, people, and processes of the enterprise activity serve the whole organization. It is not just a “side project” assigned to an R&D lab. Enterprise data science is a way to recognize the critical value and strategic asset that data brings to the organization. Organizations are now embracing a data-driven culture that learns from, makes discoveries and decisions from, and innovates with the help of the data that is being collected from all business units. Nobody would say that I.T. serves only one piece of the business – it definitely serves the whole business. As such, it is vital to the successful operation and mission of the business.
Data science is now reaching a similar status. I have authored a “data science declaration” that sums up the idea of enterprise data science for me: “Now is the time to begin thinking of Data Science as a profession not a job, as a corporate culture not a corporate agenda, as a strategy not a stratagem, as a core competency not a course, and as a way of doing things not a thing to do.” (Reference: http://rocketdatascience.org/?p=169)
How does Booz Allen Hamilton use data science?
Booz Allen Hamilton (BAH) established a NextGen Analytics and Data Science team within its Strategic Innovation Group in 2013, under the leadership of firm partners, Executive VP’s, and VP’s. This extraordinarily dedicated effort and focus are further strengthened by the data science team that includes over 500 data scientists, which represents less than half of all the data scientists within the firm. BAH believes that “data science is a team sport”. There are many roles, many application areas, many skills, and many skill levels across such a large team, and everyone’s contribution is important.
The uses of data science within BAH are essentially unlimited – whatever our clients need, in whatever domain they need it, and to whatever depth of mathematical or computational complexity that is required, BAH is there to serve it. It would be impossible to name all of the areas where data science is used in BAH, but the list would include: cybersecurity, healthcare, professional sports, retail, manufacturing, transportation, connected cars, internet of things, pharma, oil and gas, human capital and recruiting, human resources and organizational science, social sciences, national intelligence, geospatial modeling, quantum computing, data for societal good, satellite imagery, process mining, cognitive computing, procurement, fraud and compliance checking, and so much more. I am sure that I have missed a few, but that is just an indication of how broad and deep the expertise and uses go.
What skills should business executives have, to communicate effectively with data scientists?
Data literacy is a fundamental skill that is even more important than raw data science skills (machine learning, applied math, computational programming, and data systems technologies). Therefore, data scientists should have some domain knowledge of the business to communicate effectively with business executives. The business executives should also know basic data science concepts to communicate with data scientists. These concepts include data types, modeling approaches (supervised vs unsupervised), training vs testing datasets, model accuracy vs precision, bias vs variance, descriptive vs predictive analytics, and basic statistics.
How is big data changing the world? Please give us interesting examples.
Big data is a concept that represents the fact that almost everything is being measured, quantified, and tracked. This includes social, mobile, and online data collection on persons, things, and processes. These numerous data sources enable the discovery of links and connections across different domains and different parts of the business. We see examples of unexpected discoveries in medicine that are found by following the links between different research studies or different patient records. We see better customer engagement and customer experiences from retailers who use their data collections to identify their customers’ interests, intents, and preferences better than ever.
We see improved manufacturing, machine operations, and supply chain efficiencies in many industries who are paying attention to signals in the data that alert the business to the changing conditions, evolving demand, predictive maintenance requirements, evolving consumer sentiment, and more. We see the applications of discovery from data that leads to improved delivery of medicines, natural resources, education, financial investments, and basic services to communities and people in need. Personalized health devices are generating massive amounts of personal information that are leading to improved personal health and quality of life for many persons. Massive financial savings (in the tens of billions of dollars, at least) have been accrued through better financial fraud modeling and detection in the insurance and payment industries. In short, there is almost no sector of society that is not considering (or planning for) greater operational efficiencies, significant cost savings, and better outcomes through data analytics. We will see these benefits, applications, and discoveries explode as massive stream of data from the Internet of Things begin to get exploited through fast analytics within the next decade.
What are your go-to tools for doing data science?
I use basic tools (not too many advanced tools) because my role is primarily an instructor, mentor, and advisor. My goal is to communicate “what to do”, “why do it”, and “which methods to use”, with a bit of “how to do it”. I use some simple scripting languages like Matlab, and I am beginning to learn more Python – I use these for project-specific Markov Modeling and Monte Carlo Modeling. I also use data mining packages that are easily accessible online, plus some statistical modeling packages with specific capabilities (e.g., Bayesian modeling, regression). Since my role in my organization is more senior and advisory, that works for me. This is not typical for most data scientists.
In your Ted talk “Big data, small world” you talked about novelty discovery, class discovery, and association’s discovery. But you gave a special mention to association discovery, is there a specific reason for that?
Association discovery is one of the four basic methods of unsupervised discovery (i.e., finding patterns and trends in data without requiring labeled training data). The four methods are class discovery (i.e., clustering or segmentation), novelty discovery (i.e., outlier detection), correlation discovery (e.g., Principal Component Analysis, or Regression Modeling), and finally association discovery (i.e., link mining, or market basket analysis). I focused on the latter (association mining) in my TedX talk for two primary reasons: (1) I wanted to present several fun and interesting examples of association mining that work well with a general audience; and (2) my talk was specifically focused on how we can use big data to discover interesting connections, relationships, and associations among persons, places, products, and events. My talk mentioned the well-known concept of “6 degrees of separation” as an illustration of how a similar concept (association discovery) in big data is producing some amazingly useful and interesting discoveries.
Data Scientist has been termed as the sexiest job of 21st century. Do you agree? What advice would you give to people aspiring a long career in Data Science?
Yes, I agree with that label. For me, that means that data science is attracting the best and brightest young people (as well as more experienced persons) to the field. These individuals see many opportunities for data scientists and how interesting, exciting, powerful, beneficial, and diverse those options can be. The field is so much better and so much richer, thanks to these persons’ contributions. I am constantly in awe of this new breed of data scientists, who (in some cases) are younger than my own children! I encourage aspiring career data scientists to be lifelong learners, to continue to explore new ways of doing things, to have a broad vision to where their talents can deliver great benefits to others, and to focus on developing the top four C’s of data science aptitudes: Curious, Communicative, Creative, and Courageous problem-solver.