NIST Big Data Working Group

Download this Presentation

0

Presentation Transcript

  • 1.NIST Big Data Public Working Group Definition and Taxonomy Subgroup Presentation September 30, 2013 Nancy Grady, SAIC Natasha Balac, SDSC Eugene Lister, R2AD
  • 2.Overview Objectives Approach Big Data Component Definitions Data Science Component Definitions Taxonomy Roles Activities Components Subcomponents Templates Next Steps 2
  • 3.Objectives Identify concepts Focus on what is new and different Clarify terminology Attempt to avoid terms that have domain-specific meanings Remain independent of specific implementations 3
  • 4.Approach Hold scope to what is different because of Big Data Use additional concepts needed for completeness Restrict terms to represent single concepts Don’t stray too far from common usage In the report go straight to Big Data and Data Science This presentation will start from more elemental concepts Relationship to cloud, but not required 4
  • 5.Definitions Big Data Data Science 5
  • 6.Concepts Relating to Data Data Type (structured, semi-structured, unstructured) Beyond our scope (and not new) Data Lifecycle Raw Data Usable Information Synthesized Knowledge Implemented Benefit Metadata: data about data or system or processing Provenance: Data Lifecycle history Complexity: dependent relationships across data elements 6
  • 7.Concepts Relating to Dataset at Rest Volume: amount of data Variety: many data types and also across data domains Persistence: storing in {flat files, RDBMS, NoSQL, markup,…} NoSQL Big Table Name-value Graph Document Tiered storage {in-memory, cache, SSD, hard disk, …} Distributed {local, multiple local, network-based} 7
  • 8.Concepts Related to Dataset in Motion Velocity: rate of data flow Variability: change in rate of data flow, also Structure Refresh rate Accessibility: new concept of Data-as-a-Service Transport formats (not new) Transport protocols (not new) 8
  • 9.Big Data Analogy to Parallel computing Processor improvements slowed Coordinate a loose collection of processors Adds resource communication complexities System clocks Message passing Distribution of processing code Distribution of data for processing nodes 9
  • 10.Big Data - Jan 15-17 NIST Cloud/Big Data Workshop Big Data refers to digital data volume, velocity, and/or variety that: Enable novel approaches to frontier questions previously inaccessible or impractical using current or conventional methods; and/or Exceed the storage capacity or analysis capability of current or conventional methods and systems. Differentiates by storing and analyzing population data and not sample sizes 10
  • 11.Refinements are Welcome The heart of the change is the scaling Data seek times increasing slower than Moore’s Law Data volumes increasing faster than Moore’s Law Implies the addition of horizontal scaling to vertical scaling Data analogous to MPP processing changes Difficult to define as An implication of engineering changes Data Lifecycle process order changes Implication of a new type of analytics As moving the processing to the data not the data to the processing 11
  • 12.Big Data Analytics Characteristics Analytics Characteristics are not new Value: produced when the analytics output is put into action Veracity: measure of accuracy and timliness Quality: well-formed data Missing values cleanliness Latency: time between measurement and availability Data types have differing pre-analytics needs 12
  • 13.Data Science as a Science Progression Coined the “Fourth Paradigm” by the late Jim Gray Experiment: Empirical measurement science Theory: Causal interpretation Explains experiments Calculates measurements that would confirm the theoretical models Simulation: Performing theory (model)-driven experiments that are not empirically possible Data Science: Empirical analysis of data produced by processes 13
  • 14.Data Science Analogy (simplistically) Statistics precise deterministic causal analysis over precisely collected data Data Mining: deterministic causal analysis over re-purposed data that has been carefully sampled Data Science Trending or correlation analysis Over existing data that typically uses the bulk of the population 14
  • 15.Data Science Data Science is the extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis. A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle (through action) to deliver value. 15
  • 16.Data Science Skillsets 16
  • 17.Data Science Addendums Is not just Analytics The end-to-end data system is the equipment The analytics over Big Data can be Exploratory or discovery-driven for hypothesis generation Focused hypothesis verification Focused on operationalization 17
  • 18.Taxonomy Actors Roles Activities Components Subcomponents 18
  • 19.Big Data Taxonomy Actors Roles Activities Components Sub-components 19
  • 20.Actors Sensors Applications Software agents Individuals Organizations Hardware resources Service abstractions 20
  • 21.System Roles Data Provider – makes available data internal and/or external to the system Data Consumer – uses the output of the system System Orchestrator – governance, requirements, monitoring Big Data Application Provider – instantiates application Big Data Framework Provider – provides resources 21
  • 22.Roles and Actors 22
  • 23.Data Provider 23
  • 24.System Orchestrator 24
  • 25.Big Data Application Provider 25
  • 26.Big Data Framework Provider 26
  • 27.Data Consumer 27
  • 28.Big Data Security 28
  • 29.Big Data Application Provider 29
  • 30.Data Lifecycle Processes 30 Collect Analyze Need Curate Act & Monitor Data Information Knowledge Benefit Goal Evaluate
  • 31.Data Warehouse Template– store after curate 31 Domain Cleanse Transform ETL Action Warehouse Summarized Data Algorithm Analytic Mart COLLECT CURATE ANALYZE ACT Staging ETL = extract, transform, load
  • 32.Volume template – store raw data after collect 32 Raw Data Cluster Model Building Model Analytics Data Product Map/Reduce Mart Model Data COLLECT CURATE ANALYZE ACT Volume Complexity Domain Cleanse Transform Analyze
  • 33.Velocity Template – store after analytics 33 COLLECT CURATE ANALYZE ACT Enriched Data Cluster Velocity Volume Alerting Domain Cleanse Transform
  • 34.Variety Template – Schema-on-Read 34 Analyze Common Query Fused Data COLLECT CURATE ANALYZE ACT Variety Complexity Map/Reduce Query
  • 35.Analysis to Action Template Seconds – Streaming Real-time Analytics Minutes– Batch jobs of operational model Hours – Ad-hoc analysis Months – Exploratory analysis 35
  • 36.Possible Next Steps Refinement Big Data Definition Word-smithing of all definitions Refinement Taxonomy Mindmap for completeness Exploration of Templates for categorization Data distribution templates according to CAP compliance Measures and Metrics (how big is Big Data) 36