1.NIST Big Data Public Working Group
Definition and Taxonomy Subgroup Presentation
September 30, 2013
Nancy Grady, SAIC
Natasha Balac, SDSC
Eugene Lister, R2AD
2.Overview
Objectives
Approach
Big Data Component Definitions
Data Science Component Definitions
Taxonomy
Roles
Activities
Components
Subcomponents
Templates
Next Steps
2
3.Objectives
Identify concepts
Focus on what is new and different
Clarify terminology
Attempt to avoid terms that have domain-specific meanings
Remain independent of specific implementations
3
4.Approach
Hold scope to what is different because of Big Data
Use additional concepts needed for completeness
Restrict terms to represent single concepts
Don’t stray too far from common usage
In the report go straight to Big Data and Data Science
This presentation will start from more elemental concepts
Relationship to cloud, but not required
4
5.Definitions
Big Data
Data Science
5
6.Concepts Relating to Data
Data Type (structured, semi-structured, unstructured)
Beyond our scope (and not new)
Data Lifecycle
Raw Data
Usable Information
Synthesized Knowledge
Implemented Benefit
Metadata: data about data or system or processing
Provenance: Data Lifecycle history
Complexity: dependent relationships across data elements
6
7.Concepts Relating to Dataset at Rest
Volume: amount of data
Variety: many data types
and also across data domains
Persistence: storing in {flat files, RDBMS, NoSQL, markup,…}
NoSQL
Big Table
Name-value
Graph
Document
Tiered storage {in-memory, cache, SSD, hard disk, …}
Distributed {local, multiple local, network-based}
7
8.Concepts Related to Dataset in Motion
Velocity: rate of data flow
Variability: change in rate of data flow, also
Structure
Refresh rate
Accessibility: new concept of Data-as-a-Service
Transport formats (not new)
Transport protocols (not new)
8
9.Big Data Analogy to Parallel computing
Processor improvements slowed
Coordinate a loose collection of processors
Adds resource communication complexities
System clocks
Message passing
Distribution of processing code
Distribution of data for processing nodes
9
10.Big Data - Jan 15-17 NIST Cloud/Big Data Workshop
Big Data refers to digital data volume, velocity, and/or variety that:
Enable novel approaches to frontier questions previously inaccessible or impractical using current or conventional methods; and/or
Exceed the storage capacity or analysis capability of current or conventional methods and systems.
Differentiates by storing and analyzing population data and not sample sizes
10
11.Refinements are Welcome
The heart of the change is the scaling
Data seek times increasing slower than Moore’s Law
Data volumes increasing faster than Moore’s Law
Implies the addition of horizontal scaling to vertical scaling
Data analogous to MPP processing changes
Difficult to define as
An implication of engineering changes
Data Lifecycle process order changes
Implication of a new type of analytics
As moving the processing to the data not the data to the processing
11
12.Big Data Analytics Characteristics
Analytics Characteristics are not new
Value: produced when the analytics output is put into action
Veracity: measure of accuracy and timliness
Quality:
well-formed data
Missing values
cleanliness
Latency: time between measurement and availability
Data types have differing pre-analytics needs
12
13.Data Science as a Science Progression
Coined the “Fourth Paradigm” by the late Jim Gray
Experiment: Empirical measurement science
Theory: Causal interpretation
Explains experiments
Calculates measurements that would confirm the theoretical models
Simulation: Performing theory (model)-driven experiments that are not empirically possible
Data Science: Empirical analysis of data produced by processes
13
14.Data Science Analogy (simplistically)
Statistics
precise deterministic causal analysis
over precisely collected data
Data Mining:
deterministic causal analysis
over re-purposed data that has been carefully sampled
Data Science
Trending or correlation analysis
Over existing data that typically uses the bulk of the population
14
15.Data Science
Data Science is the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.
A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle (through action) to deliver value.
15
16.Data Science Skillsets
16
17.Data Science Addendums
Is not just Analytics
The end-to-end data system is the equipment
The analytics over Big Data can be
Exploratory or discovery-driven for hypothesis generation
Focused hypothesis verification
Focused on operationalization
17
21.System Roles
Data Provider – makes available data internal and/or external to the system
Data Consumer – uses the output of the system
System Orchestrator – governance, requirements, monitoring
Big Data Application Provider – instantiates application
Big Data Framework Provider – provides resources
21
22.Roles and Actors
22
23.Data Provider
23
24.System Orchestrator
24
25.Big Data Application Provider
25
26.Big Data Framework Provider
26
27.Data Consumer
27
28.Big Data Security
28
29.Big Data Application Provider
29
30.Data Lifecycle Processes
30
Collect
Analyze
Need
Curate
Act &
Monitor
Data
Information
Knowledge
Benefit
Goal
Evaluate
31.Data Warehouse Template– store after curate
31
Domain
Cleanse Transform
ETL
Action
Warehouse
Summarized Data
Algorithm
Analytic
Mart
COLLECT
CURATE
ANALYZE
ACT
Staging
ETL = extract, transform, load
32.Volume template – store raw data after collect
32
Raw Data Cluster
Model Building
Model Analytics
Data Product
Map/Reduce
Mart
Model Data
COLLECT
CURATE
ANALYZE
ACT
Volume
Complexity
Domain
Cleanse
Transform
Analyze
33.Velocity Template – store after analytics
33
COLLECT
CURATE
ANALYZE
ACT
Enriched Data Cluster
Velocity
Volume
Alerting
Domain
Cleanse
Transform
34.Variety Template – Schema-on-Read
34
Analyze
Common Query
Fused
Data
COLLECT
CURATE
ANALYZE
ACT
Variety
Complexity
Map/Reduce
Query
35.Analysis to Action Template
Seconds – Streaming Real-time Analytics
Minutes– Batch jobs of operational model
Hours – Ad-hoc analysis
Months – Exploratory analysis
35
36.Possible Next Steps
Refinement Big Data Definition
Word-smithing of all definitions
Refinement Taxonomy Mindmap for completeness
Exploration of Templates for categorization
Data distribution templates according to CAP compliance
Measures and Metrics (how big is Big Data)
36