--- Apache Flume
--- ---
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
--- ---
--- Apache Flink
--- ---
Apache Flink является распределенной платформой обработки данных с открытым исходным кодом, разработанной Apache Software Foundation для использования данных в больших приложениях, прежде всего, для анализа данных, хранящихся в кластерах Hadoop. Поддерживает комбинацию обработки в оперативной памяти и на основе диска.
Flink, что на немецком языке означает быстрый или проворный, является открытым исходным кодом платформы для распределенного потока и пакетной обработки данных. Первоначально названный Стратосфера (2010) перед добавлением в программу инкубатор от Apache (2014), Flink был разработан в Техническом университете Берлина, чтобы быть заменой для Hadoop MapReduce (в рамках программного обеспечения для распределенных вычислений больших наборов данных), используя подход ориентированного графа, устраняя необходимость в отображении и сокращения.
--- ---
--- Kinesis Stream
--- ---
* Fast log and data feed intake and processing
* Real time metrics and reporting
* Real time data analytics
* Complex stream processing
Kinesis Data Streams does not provide data transformation and record format conversion
* Real time aggregation of data
* Loading the aggregate data into a data warehouse/map reduce cluster
* Durability and Elasticity
* parrallel application readers
Methods to load/get data
* KPL (Kinesis Producer Library)
* KCL (get data)
* Kinesis Agent
* Kinesis Rest API
Where Kinesis Data Stream send data to:
* S3(with Java based connector library)
* EMR(with Java based connector library)
* DynamoDB(with Java based connector library)
* RedShift(with Java based connector library)
* ElasticSearch(with Java based connector library)
* Lambda(use on the stream and can send data to S3, DynamoDB, RedShift)
Single shard capacity
* 1 MB/sec data input
* 2 MB/sec data output
* 5 transaction/sec for read
* 1000 records/sec for write
Max size of data record(data blob payload) = 1MB
Retention period:
* default = 24hours
* max = 7days
KPL antipatterns:
* when it is critical not to have delay in producer application
KPL batching:
* Aggregation(combine multiple user record in single stream record)
* Collection(multiple stream records combined in single http request)
Kinesis Agent features:
* convert multi record line to one record
* convert from delimiter to JSON format
* convert record from log format to JSON format
* uses CloudWatch
Kinesis Connector Library
This is a pre-built library that helps you easily integrate Amazon Kinesis Data Streams with other AWS services and third-party tools. Amazon Kinesis Client Library (KCL) is required for using this library.
Can connect with
* DynamoDB
* Redshift
* S3
* Elasticsearch
Amazon Kinesis Data Streams can automatically encrypt sensitive data as a producer enters it into a stream. Kinesis Data Streams uses AWS KMS master keys for encryption.
Client side encryption does not provide protection for kinesis data streams. Data protection can only be enabled through Server side encryption. Server-side encryption using AWS Key Management Service (AWS KMS) keys can be enabled to meet strict data management requirements by encrypting your data at rest within Amazon Kinesis Data Streams. This encrypts data before it's at rest by using an AWS KMS customer master key (CMK) you specify. Data is encrypted before it's written to the Kinesis stream storage layer, and decrypted after it’s retrieved from storage. As a result, your data is encrypted at rest within the Kinesis Data Streams service.
Server-side encryption is a feature in Amazon Kinesis Data Streams that automatically encrypts data before it's at rest by using an AWS KMS customer master key (CMK) you specify. Data is encrypted before it's written to the Kinesis stream storage layer, and decrypted after it’s retrieved from storage. As a result, your data is encrypted at rest within the Kinesis Data Streams service. This allows you to meet strict regulatory requirements and enhance the security of your data.
--- ---
--- KCL
--- ---
* Uses DynamoDB for checkpointing
* Creates DynamoDB table with name = application name
* DynamoDB table creates with 10 RCU and 10 WCU
When you start a KCL application, it calls the KCL to instantiate a worker. This call provides the KCL with configuration information for the application, such as the stream name and AWS credentials.
The KCL performs the following tasks:
Connects to the stream
Enumerates the shards
Coordinates shard associations with other workers (if any)
Instantiates a record processor for every shard it manages
Pulls data records from the stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations when the worker instance count changes
Balances shard-worker associations when shards are split or merged
--- ---
--- Kinesis Firehose
--- ---
* Amazon Kinesis Agent
* S3
* RedShift
* ElasticSearch
* Splunk
Max record size = 1MB
Buffer size from 1MB to 128MB
Buffer interval from 60s to 900s
* S3: depends on buffer size
* ElasticSearch: depends on buffer size
* RedShift: bow fast RedShift finish COPY command, firehose will launch new COPY command automatically
* S3: retry delivery for up to 24h
* RedShift: you can specify retry duration 0-7200 seconds from S3
* ElasticSearch: you can specify retry duration 0-7200 seconds from S3
Can do:
* Record format conversion
* Data transformation(Lambda blueprints)
Kinesis Data Firehose provides the following Lambda blueprints that you can use to create a Lambda function for data transformation.
General Firehose Processing — Contains the data transformation and status model described in the previous section. Use this blueprint for any custom transformation logic.
Apache Log to JSON — Parses and converts Apache log lines to JSON objects, using predefined JSON field names.
Apache Log to CSV — Parses and converts Apache log lines to CSV format.
Syslog to JSON — Parses and converts Syslog lines to JSON objects, using predefined JSON field names.
Syslog to CSV — Parses and converts Syslog lines to CSV format.
Kinesis Data Firehose Process Record Streams as source — Accesses the Kinesis Data Streams records in the input and returns them with a processing status.
Kinesis Data Firehose CloudWatch Logs Processor — Parses and extracts individual log events from records sent by CloudWatch Logs subscription filters.
--- ---
--- Kinesis Analytics
--- ---
Streaming Sources:
* Amazon Kinesis Data Streams
* Amazon Kinesis Data Firehose
* Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk)
* AWS Lambda
* Amazon Kinesis Data Streams
Stagger Windows
Using stagger windows is a windowing method that is suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series analytics use case, such as a set of related sales or log records.
For example, VPC Flow Logs have a capture window of approximately 10 minutes. But they can have a capture window of up to 15 minutes if you're aggregating data on the client. Stagger windows are ideal for aggregating these logs for analysis.
Stagger windows address the issue of related records not falling into the same time-restricted window, such as when tumbling windows were used.
Tumbling Windows (Aggregations Using GROUP BY)
When a windowed query processes each window in a non-overlapping manner, the window is referred to as a tumbling window. In this case, each record on an in-application stream belongs to a specific window. It is processed only once (when the query processes the window to which the record belongs).
Tumbling Windows query, A query that aggregates data using distinct time-based windows that open and close at regular intervals.
Sliding Windows
Sliding windows query, A query that aggregates data continuously, using a fixed time or rowcount interval.
Continuous Queries
Continuous Query is a query over a stream executes continuously over streaming data. This continuous execution enables scenarios, such as the ability for applications to continuously query a stream and generate alerts.
--- ---
--- SQS
--- ---
MAX message size 256Kb
MAX retention period 14 days
--- ---
--- IoT
--- ---
IoT Rule Engine can send data to:
* Elasticsearch
* Kinesis Firehose
* Kinesis Stream
* DynamoDB
* ML
* CloudWatch
* S3
* Lambda
AWS IoT Authentication
AWS IoT supports four types of identity principals for authentication:
X.509 certificates
IAM users, groups, and roles
Amazon Cognito identities
Federated identities
Typically, AWS IoT devices use X.509 certificates, while mobile applications use Amazon Cognito identities. Web and desktop applications use IAM or federated identities.
--- ---
--- Data Pipeline
--- ---
Move data across regions
AWS Data Pipeline supports the following types of databases:
1) JdbcDatabase - A JDBC database
2) RdsDatabase - An Amazon RDS database
3) RedshiftDatabase - An Amazon Redshift database
--- ---
--- DynamoDB
--- ---
Use cases:
* Mobile
* Web
* Gaming
* IoT
* Live online voting
* Session management
* Store S3 object metadata
MAX item size = 400KB
1 WCU(Write Capacity Unit) = 1KB blocks write per secong. Round to 1KB(500K read will take 1 WCU)
1 RCU(Read Capacity Unit) = 4KB blocks read per second. Round to 4KB(2KB read will take 1 RCU)
Strong Consistent Read = 1 RCU
Eventually Consistent Read = 0.5 RCU
Partition Key = Hash Key
Sort Key = Range Key
Primary Key = Partition Key only = Partition Key + Sort Key
* RedShift - direct COPY from DynamoDB
* EMR - DynamoDB integrated with Hive(read/write)
* S3 - export/import from/to DynamoDB via Data Pipeline
* Lambda - can be triggered from DynamoDB
* Kinesis Stream - KCL from Stream to DynamoDB
* Can store 10Gb of data MAXIMUM
* 3000 RCU MAX
* 1000 WCU MAX
New partition create when >10GB or >3000RCU or >1000WCU
Total partitions for desired performance = (Desired RCU / 3000 RCU) + (Desired WCU / 1000 WCU)
Total partitions for desired storage = Desired capacity in GB / 10GB
Total partitions = MAX(Total partitions for desired performance, Total partitions for desired capacity)
--- ---
--- ---
Default block size = 64Mb
We can set block size per file
Default replication factor:
* <= 3 nodes = 1
* 4-9 nodes = 2
* >= 10 nodes = 3
--- ---
--- EMR Hive
--- ---
It is Data Warehous infrastructure built on top of hadoop
It allows to summarize, query and analyze very large data sets
Uses SQL-like interface (HiveSL)
Use cases:
* Process and analyze logs
* Join very large tables
* Batch jobs
* Ad-hoc interactive queries over HDFS/S3 data
You can connect to Hive via JDBC/ODBC drivers
Hive on EMR (in comparing to classic Apache Hive) has additional integrations:
* S3
* * read and write data to/from S3
* * Partitioning in Hive. Partitioning supported with S3. (s3://
* DynamoDB(via EMR DynamoDB connector)
* * Join Hive and DynamoDB tables using HiveQL
* * Query data in DynamoDB tables using HiveQL
* * Copy data from DynamoDB tables to HDFS vice versa
* * Copy data from DynamoDB to S3
* * Copy data from S3 to DynamoDB
* Kinesis Streams
* It is like predefined/custom expression for data format conversion, something like that
--- ---
--- EMR Tez
--- ---
Представляет собой альтернативу традиционному MapReduce, который позволяет выполнять задания для удовлетворения требований быстрого реагирования и максимальной пропускной способности в масштабе петабайт.
It is default execution engine in EMR > 5.x (instead of MapReduce)
It is works on top of YARN
--- ---
--- EMR EC2 choosing
--- ---
* batch oriented
* M3 or M4 instance types
* scale horisontally
Machine Learning:
* P2 instance types
* C3 or C4 instance types
* R3 or R4 instance types
HDFS/MapRedure with high IO performance or high IOPS
* I2 or I3
* D2
Master node:
* <= 50 slave nodes: m3.xlarge or m4.xlarge
* >= 50 slave nodes: m3.2xlarge or m4.2xlarge
--- ---
--- EMR Oozie
--- ---
Scheduler. Workflow management system to manage hadoop jobs
--- ---
--- EMR Sqoop
--- ---
Database migration tool, from relational databases to hadoop
--- ---
--- EMR HBase
--- ---
Intergated with:
* Apache Hadoop
* Apache Hive
* Apache Phoenix
Use cases:
* Massive data storage
* High speed of data processing
When to use HBase:
* Large amount of data - 100s of GBs to PBs
* High write throughput and update rates
* NoSQL, flexible schema
* Fast access to data, random and real-time.
* Fault tolerance in non-relational environment
When DO NOT USE HBase:
* Transactional applications
* Relational database type features
* Small amount of data
--- ---
--- EMR Phoenix
--- ---
SQL and JDBC API for HBase. OLAP and OLTP
--- ---
--- EMR Presto
--- ---
* Open-source in-memory distributed fast SQL query engine
* Run interactive analytic queries against a variety of data sources with sizes ranging from GBs to PBs.
* Faster than Hive
Advantages of using Presto:
* You can query different type of datasources - from RDBMS, NoSQL, framevorks like Hive to stream processing platform like Kafka, Cassandra, Hive, Kafka, MongoDB, MsSQL, PostgresSql, Redis
* High concurency, run thousand of queries per day
* In-memosy processing helps avoid unnecessary I/O, leading to low latency
* Queries runs directly on Presto engine, they do not need enterpreper like Hive does(Tez)
Metadata of Presto on EMR is stored in Glue Data Catalog
When not to use:
* Not database and not designed for OLTP
* Joining very large(100M plus rows) requires optimisation(use Hive instead)
* Batch processing
--- ---
--- EMR Spark
--- ---
Use cases:
* Interactive Analytics(Faster than run queries on Hive)
* Flexibility in terms of languages (Scala, Python etc)
* Flexibility in data sources(JDBC/ODBC; Files ORC,Parquet, AVRO, JSON; Hive tables etc) with data joins between them
* Run query against live data(Spark 2.0)
* * Structured Streaming
* Stream Processing
* * Disparate data sources
* * Small sizes
* * Process in real time for analyzisys
* Machine learning(MLLib)
* * Recomendation engine
* * Fraud detection
* * Customer segmentation
* * Security
* Data integration
* * ETL
When not to use:
* Not a database and not for OLTP
* Batch processing
* Many concurent requests(avoid for large multi-user reporting environments with high concurency)
* * Run ETL in Spark and copy the data to typical reporting database
* * Run batch jobs in Hive instead
Spark SQL:
* Run low latency interactive SQL queries against structured data
* RDD and DataFrame API for access datasources using Scala, Python, R or SQL
* Avro, Parquet, ORC, JSON
* JOIN across data sources
* Support querying Hive tables using HiveQL
* Quesy database via JDBC/ODBC
Spark Streaming:
* Sources:
* * Kafka
* * Flume
* * HDFS/S3
* * Kinesis
* Targets:
* * HDFS
* * Databases
* * Dashboards
Spark MLLib algorithms:
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
Spark on EMR integrates with services:
* DynamoDB(EMR-DynamoDB connector)
* RDS(JDBC data source with Spark SQL)
* ElasticSearch(elasticsearch connector)
* RedShift(Amazon RedShift Copy from HDFS)
* Kafka(streaming data connector)
* Kinesis(streaming data connector)
--- ---
--- EMR File Storage and Compression
--- ---
Algorithm | Splittable ? | Compression Ratio | Compress/Decompress Speed |
GZIP | No | High | Medium |
bzip2 | Yes | Very High | Slow |
LZO | Yes | Low | Fast |
Snappy | No | Low | Very fast |
Files format supported by EMR
* Text (csv, tsv)
* Parquet (Columnar oriented file format)
* ORC (Optimized Row Columnar file format)
* Sequence (Flat files consisting of binary key/value pairs)
* Avro (Data serialization framework)
File sizes:
* GZIP files are not splittable, keep them in 1-2Gb Range
* Avoid using smaller files (100MB or less), plan for fewer files
* S3DictCp can be used to combine small files into larger files
--- ---
--- EMR Security
--- ---
At-rest data encryption for ec2 instances(instance store volumes and EBS volumes):
* open-source HDFS encryption
* LUKS encryption(Linux unified key setup)
Two files will be undated:
* /etc/hadoop/conf/core-site.xml
* * hadoop.rpc.protection privacy
* /etc/hadoop/conf/hdfs-site.xml
* * dfs.encrypt.data.transfer true
EMRFS encryption at-rest:
* SSE-S3
* CSE-Custom
--- ---
--- HCatalog
--- ---
Allow you to access Hive metastore tables with BigData tools:
* Pig
* SparkSQL
* Custom MapReduce applications
--- ---
--- AWS Glue
--- ---
Fully managed ETL service:
* Categorize, clean and enrich your data
* Move data between various data sources
* Provide data catalog
--- ---
--- RedShift
--- ---
Block size = 1Mb
Max Concurent user connections = 500
Total concurency level for all user-defined connections = 50
Max number of user-defined queues = 8
Super user queue = 1
Default concurency per queue = 5
Zone maps = min and max value for each block
HSM encryption is not supported for DC2 node types.
Use cases:
* Data warehouse
* BI application, usually used by users in business hours
Data nodes types:
* dc2.large 2vCPU, 15Gb RAM 0.16Tb SSD, 1-32 nodes, Total capacity = 5.12Tb, 2 slices
* dc2.8xlarge 32vCPU, 244Gb RAM 2.56Tb SSD, 2-128 nodes, Total capacity = 327.68Tb, 16 slices(dc1.8xlarge - 32 slices)
* ds2.large 4vCPU, 31Gb RAM 2Tb HDD, 1-32 nodes, Total capacity = 64Tb, 2 slices
* ds2.8xlarge 36vCPU, 244Gb RAM 16Tb HDD, 2-128 nodes, Total capacity = 2Pb(2048Tb), 16 slices
Integrates with:
* DynamoDB(COPY to RedShift)
* EMR(COPY to RedShift)
* EC2(COPY to RedShift via SSH)
* AWS Data Pipeline(Jobs on Data Pipeline)
* Lambda(COPY files, manage snapshots etc)
* QuickSight
* AWS Mirgation Service
* Kinesis(Streams, Firehose, Analytics)
* Amazon Machine Learning
When not to use columnar datrabase
* small amount of data(less than 100Gb)
* Binasy Large Objects
Sort key types:
* single column sort key
* compound:
* * all rows mentioned in sort key definition
* * it is default sort key
* * PARTITION BY, ORDER BY window function
* * table sorted by columns listed in sort key
* * poor performance if the query does not include primary sort column
* interleaved
* * equal weight for each column in the sortkey
* * multiple queries/different filters
* * table maintenanse operarions - data load / vacuum are slower
* * use only with very large tables - 100 million plus rows
* * not good for data loaded in sort order
Recomendation about column for sort keys:
* TIMESTAMP column for recent data
* Columns that used in BETWEEN conditions and equality operations
* Column that are involved in a JOIN - Same column can be used for the Distribution Key and for the Sort Key
WLM Types(WorkLoad Management)
* User group - user groups assigned to WLM Group
* Query group - query assigned to WLM Group
Direct COPY to RedShift accessible from:
* S3
* EC2
* DynamoDB
Copy to RedShift via S3:
* Kinesis Firehose
* Kinesis enabled application
* Database Migration Service
Amazon RedShift Database Loader - use Lambda to load data from S3 on even driven base
Files format for COPY comand:
* csv
* delimited (data separated by "|"), often *.tbl
* fixed width
* json
* Avro
Loading encrypted data:
* SSE-S3(auto recognised by COPY)
* SSE-KMS(auto recognised by COPY)
* client-side encryption using a client-side symmetric master key
Not supported by COPY:
* SSE-C(server side encription with customer-provided key)
* Client-side encryption using a KMS-managed customer master key
* Client-side encryption using a customer-provided asymmetric master key
UNLOAD command:
* unload to S3
* automatically create files usiong SSE-S3
* you can specify SSE-KMS
* you can specify customer-side encryption with a customer-managerd key (CSE-CMK)
UNLOAD not support:
* SSE-C - S3 server side encryption using customer-supplied key
Enhanced VPC Routing - avoid traffic from S3 to RedShift via internet(you will use VPC endpoint or NAT Gateway)
VACUUM utility is not recommended for tables over 700Gb, use deep copy instead.
Automatic snapshots takes every 8 hours or every 5 Gb of data changes by default.
To enable cross region KMS encrypted snapshot copy you need to create snapshot copy grant in the destination region
Encryption at rest:
* HSM (CloudHSM or On-Prem HSM)
* You encrypted data blocks, system metadata and snapshots
* Encription is immutable. To create ununcrypted cluster from encrypted you need recreate cluster and reload data with UNLOAD/COPY
Encryption at rest - KMS:
4 tier hierarhy of encryption keys:
* master key
* cluster encription key
* database encription key
* data encryption key
STL system tables are generated from Amazon Redshift log files to provide a history of the system.
STV tables are actually virtual system tables that contain snapshots of the current system data.
Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition to database security, which is described in this section, Amazon Redshift provides these features to manage security:
* Sign-in credentials — Access to your Amazon Redshift Management Console is controlled by your AWS account privileges. For more information, see Sign-In Credentials.
* Access management — To control access to specific Amazon Redshift resources, you define AWS Identity and Access Management (IAM) accounts. For more information, see Controlling Access to Amazon Redshift Resources.
* Cluster security groups — To grant other users inbound access to an Amazon Redshift cluster, you define a cluster security group and associate it with a cluster. For more information, see Amazon Redshift Cluster Security Groups.
* VPC — To protect access to your cluster by using a virtual networking environment, you can launch your cluster in an Amazon Virtual Private Cloud (VPC). For more information, see Managing Clusters in Virtual Private Cloud (VPC).
* Cluster encryption — To encrypt the data in all your user-created tables, you can enable cluster encryption when you launch the cluster. For more information, see Amazon Redshift Clusters.
* SSL connections — To encrypt the connection between your SQL client and your cluster, you can use secure sockets layer (SSL) encryption. For more information, see Connect to Your Cluster Using SSL.
* Load data encryption — To encrypt your table load data files when you upload them to Amazon S3, you can use either server-side encryption or client-side encryption. When you load from server-side encrypted data, Amazon S3 handles decryption transparently. When you load from client-side encrypted data, the Amazon Redshift COPY command decrypts the data as it loads the table. For more information, see Uploading Encrypted Data to Amazon S3.
* Data in transit — To protect your data in transit within the AWS cloud, Amazon Redshift uses hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.
--- ---
--- Redshift Spectrum
--- ---
Redshift Spectrum supports the following structured and semistructured data formats:
Optimized row columnar (ORC)
--- ---
--- Machine Learning
--- ---
Use cases:
* Fraud detection
* Customer Service
* Letigation/Legal
* Security
* Healthcare
* Sports predictions
Unsupervised Learning
* Unlabeled data
* No knowledge of output
* "Self-guided" learning algorithm
* Aum: Figure out the data patterns and groupings
* Does not supported by Amazon ML
* You can use EMR Spart/MLLib, that supports Unsupervised learning
Supervised Learning
* Labeled data
* Desired output is known
* Providing the algorithm training data to learn from
* Aim: Predictive analytics
* Amazon Machine Learning
Datasources for Amazon ML:
* S3(csv)
* RedShift
Multiclass classification model:
* Use to generate prediction for multiple classes
* Average F1 score ranges from 0 to 1 and higher F1 score - better ML model quality
Binary classification model:
* true/false predictions
* AUC metric - Area under occur
* AUC from 0 to 1, as AUC closer to 1, as model is better
* AUC closer to 1 - good model
* AUC 0.500 - model equal to random choise
* AUC closer to 0 - you have an issue with data
* You can use your model if AUC > 0.5
Regression model:
* Predict numeric value
* RMSE number: root mean square error
* Lower RMSE - better ML quality model(lower than baseline)
--- ---
--- ElasticSearch
--- ---
Use cases:
* search
* logging and analysis
* distributed document store
* real time application monitoring
* clickstream weblog ingestion
IoT -> ElasticSearch
S3 -> Labmda -> ElasticSearch
DynamoDB -> Labmda -> ElasticSearch
Kinesis Stream -> Labmda -> ElasticSearch
Kinesis Firehose -> Labmda -> ElasticSearch
CloudWatch -> Labmda -> ElasticSearch
AWS recomends to have 3 dedicated master nodes for each es service domain
--- ---
--- Athena
--- ---
Analyza data on S3 using standart SQL
* Full standart SQL support using Presto
* Hive is supported by Athena for DDL. Create/delete/modification of tables/partitions.
* Metadata is stored in Glue
Data format supported:
* CSV, TSV, JSON, Textfile
* Parquet
Compression supported:
* Snappy
* ZLib
--- ---
--- QuickSight
--- ---
* Build visualisations
* Ad-hoc analysis
Supported data sources:
* Redshift
* Aurora
* Athena
* RDS(MariaDB 10.0 >, SQL Server 2012 >, MySQL 5.1 >, PortgreSQL 9.3.1 >)
* Databases on ec2 or on-prem(SQL Server, MySQL, PostgreSQL)
* Files (S3 or on-prem)
* * CSV
* * TSV
* * Common log format (*.clf)
* * Extended log format (*.elf)
* Software as a service
* * Salesforce
* * * Enterprise
* * * Unlimited
* * * Developer
Spice(Super-fast in-memory calculation engine)
* 10Gb per user by default at start
Data preparation:
* change field names
* add calculating fields
* sql queries
* join tables (from same data source)
* change data types
* data set checks(skip unused rows, that can not be enterpreted properly)
20 visuals per Analyysis
Visual Types:
* AutoGraph
* Bar charts horisontal and vertical
* * single-measure - ome measure for one dimention
* * multi-measure - 2 or more measure for one dimention
* * clustered - values for a dimention grouped by related dimention
* Line Charts(compare chenges over perion of time) - gross sales by month, gross sales and net sales by month, mesaures for dimention over period of time
* Area line chart - same as line chart, but area under line are fulfilled by colors
* Pivot table - way to summarize data. You can change row sort order, apply statistical functions
* Scatter plot
* * two or three measures for a dimention
* * each bobble replresents an item in the dimention
* Tree map
* Pie chart
* Heat map (identify trends and outliers)
--- ---
--- Zeppelin
--- ---
Use cases:
* Zeppelin, SparkSQL, MLLib on EMR can be use together for exploratory Data Science and for recomendation engine
* Kinesis Streams, Zeppelin and Spark Streaming can be used together for analyzing realtime data
Zeppelin for Spark
Jupiter for Python
--- ---
--- D3.js
--- ---
Read data from csv, tsv, or json
--- ---
--- Microstrategy
--- ---
business intelligence and analytics software
Help business to make better decisions
Dashboards, scorecards, reports, queries
Web, desctop, mobile
* Athena
* Aurora
* Redshift
--- ---
--- Apache Impala
--- ---
Apache Impala is the open source, native analytic database for Apache Hadoop
Kinesis Data Firehose Destinations:
* S3
* Redshift
* Amazon ElasticSearch
* Splunk
Kinesis Data Firehose
