Resources

Big Data A to Z: A glossary of Big Data terminology

04th Jul `14, 02:07 PM in Resources

This is almost a complete glossary of Big Data terminology widely used today. Let us know if you…

Baiju-NT
Baiju NT Contributor
Follow

This is almost a complete glossary of Big Data terminology widely used today. Let us know if you would like to add any big data terminology missing in this list.

ACID test

A test applied to data for atomicity, consistency, isolation, and durability

Aggregation

A process of searching, gathering and presenting data

Algorithm

A mathematical formula placed in software that performs an analysis on a set of data.

Anonymization

The severing of links between people in a database and their records to prevent the discovery of the source of the records.

Artificial Intelligence

Developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions.

Automatic identification and capture (AIDC)

Any method of automatically identifying and collecting data on items, and then storing the data in a computer system. For example, a scanner might collect data about a product being shipped via an RFID chip.

Avro

Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing remote procedure calls

Behavioral analytics

Using data about people’s behavior to understand intent and predict future actions.

Big Data Scientist

Someone who is able to develop the algorithms to make sense out of big data.

Business Intelligence (BI)

The general term used for the identification, extraction, and analysis of data.

Cascading
 
Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.

Call Detail Record (CDR) analysis

CDRs contain data that a telecommunications company collects about phone calls, such as time and length of call. This data can be used in any number of analytical applications.

Cassandra

Cassandra is a distributed and Open Source database. Designed to handle large amounts of distributed data across commodity servers while providing a highly available service. It is a NoSQL solution that was initially developed by Facebook. It is structured in the form of key-value.

Cell phone data

Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications.

Clickstream Analytics

The analysis of users’ Web activity through the items they click on a page.

Classification analysis

A systematic process for obtaining important and relevant information about data, also meta data called; data about data.

Cloud computing

A distributed computing system over a network used for storing data off-premises

Clustering analysis

The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.

Cold data storage

Storing old data that is hardly used on low-power servers. Retrieving the data will take longer

Comparative analysis

It ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.

Chukwa
 
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness.  Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.

Clojure

Clojure is a dynamic programming language based on LISP that uses the Java Virtual Machine (JVM). It is well suited for parallel data processing.

Cloud

A broad term that refers to any Internet-based application or service that is hosted remotely.

Columnar database or column-oriented database

A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address, and phone number. In a column-oriented database,  all names are in one column, addresses in another, and so on. A key advantage of a columnar database is faster hard disk access.

Comparators

Two ways you may compare your keys is by implementing the interface or by implementing the RawComparator interface. In the former approach, you will compare (deserialized) objects, but in the latter approach, you will compare the keys using their corresponding raw bytes.

Complex event processing (CEP)

CEP is the process of monitoring and analyzing all events across an organization’s systems and acting on them when necessary in real time.

Confabulation

The act of making an intuition-based decision appear to be data-based.

Cross-channel analytics

Analysis that can attribute sales, show average order value, or the lifetime value.

Data access

The act or method of viewing or retrieving stored data.

Dashboard

A graphical representation of the analyses performed by the algorithms

Data aggregation

The act of collecting data from multiple sources for the purpose of reporting or analysis.

Data architecture and design

How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: conceptual representation of business entities. the logical representation of the relationships among those entities, and the physical construction of the system to support the functionality.

Database

A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system (DBMS).

Database administrator (DBA)

A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.

Database as a service (DaaS)

A database hosted in the cloud and sold on a metered basis. Examples include Heroku Postgres and Amazon Relational Database Service.

Database management system (DBMS)

Software that collects and provides access to data in a structured format.

Data center

A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.

Data cleansing

The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.

Data collection

Any process that captures any type of data.

Data custodian

A person responsible for the database structure and the technical environment, including the storage of data.

Data-directed decision making

Using data to support making crucial decisions.

Data exhaust

The data that a person creates as a byproduct of a common activity–for example, a cell call log or web search history.

Data feed

A means for a person to receive a stream of data. Examples of data feed mechanisms include RSS or Twitter.

Data governance

A set of processes or rules that ensure the integrity of the data and that data management best practices are met.

Data integration

The process of combining data from different sources and presenting it in a single view.

Data integrity

The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.

Data mart

The access layer of a data warehouse used to provide data to users.

Data migration

The process of moving data between different storage types or formats, or between different computer systems.

Data mining

The process of deriving patterns or knowledge from large data sets.

Data model, data modeling

A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.

Data point

An individual item on a graph or a chart.

Data profiling

The process of collecting statistics and information about data in an existing source.

Data quality

The measure of data to determine its worthiness for decision making, planning, or operations.

Data replication

The process of sharing information to ensure consistency between redundant sources.

Data repository

The location of permanently stored data.

Data science

A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

Data scientist

A practitioner of data science.

Data security

The practice of protecting data from destruction or unauthorized access.

Data set

A collection of data, typically in tabular form.

Data source

Any provider of data–for example, a database or a data stream.

Data steward

A person responsible for data stored in a data field.

Data structure

A specific way of storing and organizing data.

Data visualization

A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.

Data warehouse

A place to store data for the purpose of reporting and analysis.

De-identification

The act of removing all data that links a person to a particular piece of information.

Demographic data

Data relating to the characteristics of a human population.

Deep Thunder

IBM’s weather prediction service that provides weather data to organizations such as utilities, which use the data to optimize energy distribution.

Distributed cache

A data cache that is spread across multiple systems but works as one. It is used to improve performance.

Distributed object

A software module designed to work with other distributed objects stored on other computers.

Distributed processing

The execution of a process across multiple computers connected by a computer network.

Distributed File System

Systems that offer simplified, highly available access to storing, analysing and processing data

Document Store Databases

A document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.

Document management

The practice of tracking and storing electronic documents and scanned images of paper documents.

Drill

An open source distributed system for performing interactive analysis on large-scale datasets. It is similar to Google’s Dremel, and is managed by Apache.

Elasticsearch

An open source search engine built on Apache Lucene.

Event analytics

Shows the series of steps that led to an action.

Exabyte

One million terabytes, or 1 billion gigabytes of information.

External data

Data that exists outside of a system.

Extract, transform, and load (ETL)

A process used in data warehousing to prepare data for use in reporting or analytics.

Exploratory analysis

Finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.

Failover

The automatic switching to another computer or node should one fail.

Flume

Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Grid computing

The performing of computing functions using resources from multiple distributed systems. Grid computing typically involves large files and are most often used for multiple applications. The systems that comprise a grid computing network do not have to be similar in design or in the same geographic location.

Graph Databases

They use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.

Hadoop

An open source software library project administered by the Apache Software Foundation. Apache defines Hadoop as “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”

Hama

Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. It’s a Top Level Project under the Apache Software Foundation.

HANA

A software/hardware in-memory computing platform from SAP designed for high-volume transactions and real-time analytics.

HBase

HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily

HCatalog

HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured.

Hive

Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

Hue

Hue (Hadoop User Experience) is an open source web-based interface for making it easier to use Apache Hadoop. It features a file browser for HDFS, an Oozie Application for creating workflows and coordinators, a job designer/browser for MapReduce, a Hive and Impala UI, a Shell, a collection of Hadoop API and more.

Impala

Impala (By Cloudera) provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase using the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

In-database analytics

The integration of data analytics into the data warehouse.

In-memory database

Any database system that relies on memory for data storage.

In-memory data grid (IMDG)

The storage of data in memory across multiple servers for the purpose of greater scalability and faster access or analytics.

Internet of Things

Ordinary devices that are connected to the internet at any time any where via sensors

Kafka

Kafka (developed by LinkedIn) is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. This type of data (page views, searches, and other user actions) are a key ingredient in the current social web.

Key Value Stores

Key value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model.

KeyValue Databases

They store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.

Latency

Any delay in a response or delivery of data from one point to another.

Linked data

As described by World Wide Web inventor Time Berners-Lee, “Cherry-picking common attributes or languages to identify connections or relationships between disparate sources of data.”

Load balancing

The process of distributing workload across a computer network or computer cluster to optimize performance.

Location analytics

Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with datasets.

Location data

Data that describes a geographic location.

Log file

A file that a computer, network, or application creates automatically to record events that occur during operation–for example, the time a file is accessed.

Machine-generated data

Any data that is automatically created from a computer process, application, or other non-human source.

Machine2Machine data

Two or more machines that are communicating with each other

Machine learning

The use of algorithms to allow a computer to analyze data for the purpose of “learning” what action to take when a specific pattern or event occurs.

MapReduce

MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

Mashup

The process of combining different datasets within a single application to enhance output–for example, combining demographic data with real estate listings.

Mahout

Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Metadata

Data about data; gives information about what the data is about.

MongoDB

MongoDB is a NoSQL database oriented to documents, developed under the open source concept. It saves data structures in JSON documents with a dynamic scheme (called MongoDB BSON format), making the integration of the data in certain applications more easily and quickly.

MPP database

A database optimized to work in a massively parallel processing environment.

Multi-Dimensional Databases

A database optimized for data online analytical processing (OLAP) applications and for data warehousing.

MultiValue Databases

They are a type of NoSQL and multidimensional databases that understand 3 dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly

Network analysis

Viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.

NewSQL

An elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL

NoSQL

NoSQL (commonly interpreted as “not only SQL“) is a broad class of database management systems identified by non-adherence to the widely used relational database management system model. NoSQL databases are not built primarily on tables, and generally do not use SQL for data manipulation.

Object Databases

They store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows object to be found with a declarative programming approach.

Object-based Image Analysis

Analysing digital images can be performed with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.

Online analytical processing (OLAP)

The process of analyzing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives).

Online transactional processing (OLTP)

The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.

OpenDremel

The open source version of Google’s Big Query java code. It is being integrated with Apache Drill.

Open Data Center Alliance (ODCA)

A consortium of global IT organizations whose goal is to speed the migration of cloud computing.

Operational data store (ODS)

A location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting.

Oozie

Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Parallel data analysis

Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.

Parallel method invocation (PMI)

Allows programming code to call multiple functions in parallel.

Parallel processing

The ability to execute multiple tasks at the same time.

Parallel query

A query that is executed over multiple system threads for faster performance.

Pattern recognition

The classification or labeling of an identified pattern in the machine learning process.

Pentaho

Pentaho offers a suite of open source Business Intelligence (BI) products called Pentaho Business Analytics providing data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities

Petabyte

One million gigabytes or 1,024 terabytes.

Pig

Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).

Predictive analytics

Using statistical functions on one or more datasets to predict trends or future events.

Predictive modeling

The process of developing a model that will most likely predict a trend or outcome.

Public data

Public information or data sets that were created with public funding

Query

Asking for information to answer a certain question

Query analysis

The process of analyzing a search query for the purpose of optimizing it for the best possible result.

R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

Re-identification

Combining several data sets to find a certain person within anonymized data

Real-time data

Data that is created, processed, stored, analysed and visualized within milliseconds

Recommendation engine

An algorithm that analyzes a customer’s purchases and actions on an e-commerce site and then uses that data to recommend complementary products.

Reference data

Data that describes an object and its properties. The object may be physical or virtual.

Risk analysis

The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision.

Root-cause analysis

The process of determining the main cause of an event or problem.

Routing analysis

Finding the optimized routing using many different variables for a certain means of transport in order to decrease fuel costs and increase efficiency.

Scalability

The ability of a system or process to maintain acceptable performance levels as workload or scope increases.

Schema

The structure that defines the organization of data in a database system.

Search data

Aggregated data about search terms used over time.

Semi-structured data

Data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.

Sentiment analysis

The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.

Server

A physical or virtual computer that serves requests for a software application and delivers those requests over a network.

Spatial analysis

It refers to analysing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.

SQL

A programming language for retrieving data from a relational database

Sqoop

Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Storm

Storm is a system of real-time distributed computing, open source and free, born into Twitter. Storm makes it easy to reliably process unstructured data flows in the field of real-time processing, which made Hadoop for batch processing.

Software as a service (SaaS)

Application software that is used over the web by a thin client or web browser. Salesforce is a well-known example of SaaS.

Storage

Any means of storing data persistently.

Storm

An open-source distributed computation system designed for processing multiple data streams in real time.

Structured data

Data that is organized by a predetermined structure.

Structured Query Language (SQL)

A programming language designed specifically to manage and retrieve data from a relational database system.

Text analytics

The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.

Transactional data

Data that changes unpredictably. Examples include accounts payable and receivable data, or data about product shipments.

Thrift

“Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.”

Unstructured data

Data that has no identifiable structure–for example, the text of email messages.

Value

All that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data.

Volume

The amount of data, ranging from megabytes to brontobytes

Visualization

A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.

WebHDFS Apache Hadoop

WebHDFS Apache Hadoop provides native libraries for accessing HDFS. However, users prefer to use HDFS remotely over the heavy client side native libraries. For example, some applications need to load data in and out of the cluster, or to externally interact with the HDFS data. WebHDFS addresses these issues by providing a fully functional HTTP REST API to access HDFS.

Weather data

Real-time weather data is now widely available for organizations to use in a variety of ways. For example, a logistics company can monitor local weather conditions to optimize the transport of goods. A utility company can adjust energy distribution in real time.

XML Databases

XML Databases allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed.

ZooKeeper

ZooKeeper is a software project of the Apache Software Foundation, a service that provides centralized configuration and open code name registration for large distributed systems. ZooKeeper is a subproject of Hadoop.

MORE FROM BIG DATA MADE SIMPLE