- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Hadoop For Windows
展开查看详情
1 .
2 .Hadoop For Windows Rohit Bakhshi DBI-B335
3 .Speaker Rohit Bakhshi Product Manager Hortonworks
4 .Modern Data Architecture Hadoop for Windows Hortonworks Data Platform under the covers Q&A Agenda
5 .Modern Data Architecture
6 .What Makes Up Big Data? Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record ERP CRM WEB BIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA
7 .A data architecture under pressure from new data APPLICATIONS DATA SYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Business Analytics Custom Applications Packaged Applications Source: IDC 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020 OLTP, ERP , CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation
8 .Hadoop within an emerging Modern Data Architecture OPERATIONS TOOLS Provision, Manage & Monitor DEV & DATA TOOLS Build & Test DATA SYSTEM REPOSITORIES SOURCES RDBMS EDW MPP OLTP, ERP, CRM Systems Documents, Emails Web Logs , Click Streams Social Networks Machine Generated Sensor Data Geolocation Data Governance & Integration Security Operations Data Access Data Management APPLICATIONS Business Analytics Custom Applications Packaged Applications OLTP, ERP , CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation
9 .Hadoop for Windows
10 .HDP for Windows Hortonworks Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability Hortonworks Data Platform 2.2 YARN : Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Slider Slider SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment Choice Linux Windows On-Premises Cloud
11 .HDP: Enterprise Data Platform HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 Falcon 0.5.0 Ranger Spark 0.14.0 0.14.0 0.98.4 4.2 0.9.3 1.2.0 0.6.0 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations
12 .Seamless Interoperability Integrations with Microsoft tools for native big data analysis SOURCES APPLICATIONS OPERATIONAL TOOLS DEV & DATA TOOLS INFRASTRUCTURE xΩ a DATA SYSTEM HDInsight Azure New! Power BI
13 .HDP: Powered by Apache Hadoop HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 Falcon 0.5.0 Ranger Spark 0.14.0 0.14.0 0.98.4 4.2 0.9.3 1.2.0 0.6.0 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations
14 .Apache Hadoop Scalable Linearly scale to store Petabytes of data Reliable Redundant storage protects against node failures Flexible Store all types of data, apply flexible schemas for analysis and sharing Economical Utilize cose efficient commodity hardware Achieve high cluster utilization Open Source Data M anagement Storage HDFS Distributed across “nodes” Natively redundant Single File System Processing YARN Cluster Resource Manager Built in F ault Tolerance High Cluster Utilization
15 .YARN: Data Operating System NodeManager NodeManager NodeManager NodeManager map 1.1 vertex 1.2.2 NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager map 1.2 reduce 1.1 Batch vertex 1.1.1 vertex 1.1.2 vertex 1.2.1 Interactive SQL ResourceManager Scheduler Real-Time nimbus 0 nimbus 1 nimbus 2
16 .Right Tool for the Right Usage Traditional Database SCALE (storage & processing) Hadoop Platform NoSQL MPP Analytics EDW s chema speed governance best fit use processing Required on write Required on read Reads are fast Writes are fast Standards and structured Loosely structured Limited, no data processing Processing coupled with data d ata types Structured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing
17 .Maximize Hadoop Deployment Choice Hortonworks Data Platform (HDP) for Windows 100% Apache open source Hadoop software for Windows Server Microsoft Azure HDInsight Hadoop -based managed service in the cloud via Microsoft Azure Microsoft Analytics Platform System (APS) Scale-out appliance with data warehousing and Hadoop in one box All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on - premises and cloud
18 .HDP under the covers
19 .Data Operating System of Hadoop Single Cluster, Shared Data Set, Multiple Workloads Support a range of access patterns Shared operational services HDP 2.2: Core Platform DATA ACCESS YARN : Data Operating System DATA MANAGEMENT 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-Memory Analytics, ISV engines Batch Map Reduce
20 .Flexible Ingest into HDP Sqoop HORTONWORKS DATA PLATFORM (HDP) For Windows RPC REST (HTTP) C LibHDFS Flume
21 .SQL Access: Stinger Initiative Stinger Initiative Next generation SQL based interactive query in Hadoop Speed Interactive Hive Query response Scale queries that scale from TB to PB SQL broadest range of SQL semantics for analytic applications Business Analytics Custom Apps Apache YARN Apache MapReduce 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N Apache Tez Apache Hive SQL ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months
22 .Apache Tez (“Speed”) Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task Processor Input Output
23 .Hive with Tez as execution engine Hive – MR Hive – Tez SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE( c.price ) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state , c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE( c.price ) SELECT b.id SELECT a.state , COUNT(*), AVERAGE( c.price ) FROM a JOIN b ON ( a.id = b.id ) JOIN c ON ( a.itemId = c.itemId ) GROUP BY a.state Tez avoids unneeded writes to HDFS
24 .Hive: Enhanced SQL Semantics Hive SQL Datatypes Hive SQL Semantics INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc ) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs ( ngram , Xpath , URL) VARCHAR Sub-queries for IN/NOT IN, HAVING CHAR Expanded JOIN Syntax INTERSECT / EXCEPT Hive 0.12 (HDP 2.0) Hive 0.11 Hive 0.13 (HDP 2.1) SQL Compliance Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
25 .Stream Processing Apache Storm Real -time event processing for sensor and business activity monitoring Scale : Ingest millions of events per second. Fast query on petabytes of data Implement new real time business cases with your Hadoop platform http:// storm.incubator.apache.org /
26 .NoSQL Database Store and Process Petabytes of Data Scale out on Commodity Servers High Performance Highly Available Integrated with YARN SQL Interface YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N ° ° ° ° ° ° HDFS (Permanent Data Storage) NoSQL HBase
27 .HDP Search HDFS ( Hadoop Distributed File System ) ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Raw Files Indexed Documents MapReduce Indexing Job Solr Solr Solr Lucene HTML PDF Word XML Logs … Search Web App Query Response Apache Solr High performance indexing and simple UI for advanced search applications
28 .All Processing on Shared Infrastructure YARN : Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Others Engines Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° Storm Stream Others Engines Slider Solr Search HBase NoSQL Slider Accumulo NoSQL Slider Spark In-Memory Kafka Slider ° ° ° ° HDFS (Hadoop Distributed File System)
29 .YARN: Next Generation Hadoop Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … 1 st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Flexible Data Processing Hive, Pig, others… Batch MapReduce Batch & Interactive Tez Online Data Processing HBase , Accumulo Stream Processing Storm o thers … 2 nd Gen of Hadoop Classic Hadoop Apps