Data Science

Time to Dive In: Leveraging public data with a data lake

18th Nov `16, 04:39 PM in Data Science

This post has been co-authored by  Adrian Botta and Meera Lakhavani. Real time, granular consumer insights are invaluable….

Anand Rao PWC
Anand Rao Contributor

This post has been co-authored by  Adrian Botta and Meera Lakhavani.

Real time, granular consumer insights are invaluable. Companies may know their customers’ basic demographics, but what if they leveraged customer travel patterns, activities, and situational preferences? They would be able to predict purchase behavior, conduct more accurate targeted marketing, and understand their consumer better. A data lake makes business insights such as these a reality within reach.

A data lake is a flexible repository that can store data in multiple formats and their original states before being used for analysis. Leveraging a data lake structure, we collect data from numerous public sources to see what can be illuminated about consumer behavior. A good description of data lake architecture and its benefits can be found in, The future of big data: data lakes and The data lake: No longer a pipe dream for today’s enterprises.

The role of a data lake in a business setting is to efficiently and inexpensively provide a wealth of insight into the industry of interest. This allows for fast querying and fact checking to validate, support, and even fuel business endeavors.  One of the five leading Data Lake practices is to take a business centric approach (see 5 leading practices for data lakes for more details).

Using a data lake to explore urban consumer patterns

To be able to explore the capabilities of a data lake in a business context, a use case and a set of goals were defined. The primary goal was to understand more about consumer patterns throughout a city. This was achieved by analyzing and visualizing publicly available data to create a birds-eye view of a city. This allowed flexibility to connect the data lake to specific applications and insights from at least one industry. After reviewing the open data sources from eight major cities across the globe, New York City’s abundance of information prompted us to build our initial data lake from its open data.

Google’s Cloud Platform served as a quick and convenient data lake. The platform enabled web-scraping, downloading, and storing of the data directly onto the cloud. We then processed and kept track of various versions of data sets using the Google Compute Engine. The engine allowed for the creation of computing instances that were capable of handling large data sets and moving data quickly to and from storage.

Data collection included transportation, business, health, safety, demographic, and other categories of data to help us paint a holistic view of the consumer. We chose to follow individual consumers at a very granular level by analyzing NYC Taxi Cab and CitiBike data instead of viewing the high level flow of thousands of consumers from subway turnstile data. Most of our data was in CSV format; however, the location and twitter data used were in json and text formats.

Apache Spark, a big data computing framework, was used with Python (PySpark) as our data munging tool. This equipped us with the ability to handle and process hundreds of millions of rows of transportation and social media data with ease. Python’s Pandas library was used on some of our smaller and more processed data sets to pull and show insights. Transportation data was linked to the rest of the data sets through reverse geocoding directly on the cloud platform.

While traditional location-based data sets contain data on consumers at home, we were able to create a dynamic view of how consumers with different demographics move throughout parts of the city. This allowed us to see exactly where certain segments of customers were throughout the day, and which Zip codes were receiving higher income consumers over the weekends versus the weekdays. To explore a more specific application, we linked NYC hotel data to taxi rides to and from the airport to get a market share approximation of the hotels in the Times Square area. Using the data lake, we were able to take static data sets and link them to one another to create powerful, granular views of consumers throughout the day.

Taking the hospitality industry application a step further, hotels can link the data from their consumer-facing apps to the data lake to help recommend places to visit, restaurants to try, or events that the customer would be interested in. The hotel could automate a rewards structure tailored to the customer’s behavior. The application possibilities of consumer behavior vectors are endless. Understanding where and why potential customers or sub-segments of customers are located during specific times can determine attitudes, activities, and purchase preferences. This allows for more effective targeted marketing: the ability to know that every marketing dollar spent is reaching a customer that is interested in the product offering.

Creating a data lake using public data to understand general customer behaviors can help target consumers better for marketing, can help determine where to open a new store, or can even show new trends in consumer behaviors. But applying it to a specific industry and linking it to private data is where the lake really starts to shine.