2019-09-17

AWS BigData Specialty

###########################################
# Cerfification info
###########################################

https://aws.amazon.com/ru/certification/certified-big-data-specialty/

###########################################
# FAQ/Limits
###########################################

Amazon Elasticsearch Service
https://aws.amazon.com/ru/elasticsearch-service/faqs/

Amazon Kinesis Data Firehose
https://aws.amazon.com/kinesis/firehose/faqs/

Amazon Kinesis Data Streams
https://aws.amazon.com/ru/kinesis/data-streams/faqs/
https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

Amazon Redshift
https://aws.amazon.com/ru/redshift/faqs/

###########################################
# Whitepapers
###########################################

https://aws.amazon.com/ru/whitepapers/#database
https://aws.amazon.com/ru/whitepapers/#analytics

###########################################
# STACK
###########################################

What should i know about each:
 * restrictions
 * relations with others
 * main functions
 * best practices
 * whitepapers
 * use cases
 * anti patterns

Athena
 * AVRO
 * Apache ORC
 * Apache Parquet
 * Snappy
Cognito
Data Pipeline
DynamoDB
ElasticSearch
EMR
 * Flink
 * Flum
 * HBase
 * HDFS
 * HCatalog
 * Hive
 * Ganglia
 * Impala
 * Mahout
 * MLLib
 * S3DistCp
 * Spark
 * Spark Streaming
 * Sqoop
 * Pig
 * Presto
 * Tez
 * Oozie
Glue
IoT
Kinesis Data Analytics
Kinesis Data Firehose
Kinesis Data Streams
Lambda
Machine
 * RANDOM_CUT_FOREST
 * MXNet
 * Tensorflow
RDS
RedShift
 * Amazon Redshift Database Loader
RedShift Spectrum
S3
SageMaker
SQS
QuickSight
VPC
 * VPC Enhanced Routing
 * VPC Flow Logs

###########################################
# Small
###########################################

--- ---
--- Apache Flume
--- ---

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

--- ---
--- Apache Flink
--- ---

Apache Flink является распределенной платформой обработки данных с открытым исходным кодом, разработанной Apache Software Foundation для использования данных в больших приложениях, прежде всего, для анализа данных, хранящихся в кластерах Hadoop. Поддерживает комбинацию обработки в оперативной памяти и на основе диска.
Flink, что на немецком языке означает быстрый или проворный, является открытым исходным кодом платформы для распределенного потока и пакетной обработки данных. Первоначально названный Стратосфера (2010) перед добавлением в программу инкубатор от Apache (2014), Flink был разработан в Техническом университете Берлина, чтобы быть заменой для Hadoop MapReduce (в рамках программного обеспечения для распределенных вычислений больших наборов данных), используя подход ориентированного графа, устраняя необходимость в отображении и сокращения.

--- ---
--- Kinesis Stream
--- ---

Scenarios:
* Fast log and data feed intake and processing
* Real time metrics and reporting
* Real time data analytics
* Complex stream processing

Kinesis Data Streams does not provide data transformation and record format conversion

Benefits:
* Real time aggregation of data
* Loading the aggregate data into a data warehouse/map reduce cluster
* Durability and Elasticity
* parrallel application readers

Methods to load/get data
* KPL (Kinesis Producer Library)
* KCL (get data)
* Kinesis Agent
* Kinesis Rest API

Where Kinesis Data Stream send data to:
* S3(with Java based connector library)
* EMR(with Java based connector library)
* DynamoDB(with Java based connector library)
* RedShift(with Java based connector library)
* ElasticSearch(with Java based connector library)
* Lambda(use on the stream and can send data to S3, DynamoDB, RedShift)

Single shard capacity
* 1 MB/sec data input
* 2 MB/sec data output
* 5 transaction/sec for read
* 1000 records/sec for write

Max size of data record(data blob payload) = 1MB

Retention period:
* default = 24hours
* max = 7days

KPL antipatterns:
* when it is critical not to have delay in producer application

KPL batching:
* Aggregation(combine multiple user record in single stream record)
* Collection(multiple stream records combined in single http request)

Kinesis Agent features:
* convert multi record line to one record
* convert from delimiter to JSON format
* convert record from log format to JSON format
* uses CloudWatch

Kinesis Connector Library
    This is a pre-built library that helps you easily integrate Amazon Kinesis Data Streams with other AWS services and third-party tools. Amazon Kinesis Client Library (KCL) is required for using this library.
Can connect with
* DynamoDB
* Redshift
* S3
* Elasticsearch

Amazon Kinesis Data Streams can automatically encrypt sensitive data as a producer enters it into a stream. Kinesis Data Streams uses AWS KMS master keys for encryption.

Encryption
    Client side encryption does not provide protection for kinesis data streams. Data protection can only be enabled through Server side encryption. Server-side encryption using AWS Key Management Service (AWS KMS) keys can be enabled to meet strict data management requirements by encrypting your data at rest within Amazon Kinesis Data Streams. This encrypts data before it's at rest by using an AWS KMS customer master key (CMK) you specify. Data is encrypted before it's written to the Kinesis stream storage layer, and decrypted after it’s retrieved from storage. As a result, your data is encrypted at rest within the Kinesis Data Streams service.
    Server-side encryption is a feature in Amazon Kinesis Data Streams that automatically encrypts data before it's at rest by using an AWS KMS customer master key (CMK) you specify. Data is encrypted before it's written to the Kinesis stream storage layer, and decrypted after it’s retrieved from storage. As a result, your data is encrypted at rest within the Kinesis Data Streams service. This allows you to meet strict regulatory requirements and enhance the security of your data.

--- ---
--- KCL
--- ---

* Uses DynamoDB for checkpointing
* Creates DynamoDB table with name = application name
* DynamoDB table creates with 10 RCU and 10 WCU

When you start a KCL application, it calls the KCL to instantiate a worker. This call provides the KCL with configuration information for the application, such as the stream name and AWS credentials.
The KCL performs the following tasks:
             Connects to the stream
             Enumerates the shards
             Coordinates shard associations with other workers (if any)
             Instantiates a record processor for every shard it manages
             Pulls data records from the stream
             Pushes the records to the corresponding record processor
             Checkpoints processed records
             Balances shard-worker associations when the worker instance count changes
             Balances shard-worker associations when shards are split or merged

--- ---
--- Kinesis Firehose
--- ---

Input:
* Amazon Kinesis Agent
* AWS SDK

Output:
* S3
* RedShift
* ElasticSearch
* Splunk

Max record size = 1MB
Buffer size from 1MB to 128MB
Buffer interval from 60s to 900s

Frequency:
* S3: depends on buffer size
* ElasticSearch: depends on buffer size
* RedShift: bow fast RedShift finish COPY command, firehose will launch new COPY command automatically

Failures:
* S3: retry delivery for up to 24h
* RedShift: you can specify retry duration 0-7200 seconds from S3
* ElasticSearch: you can specify retry duration 0-7200 seconds from S3

Can do:
* Record format conversion
* Data transformation(Lambda blueprints)

Kinesis Data Firehose provides the following Lambda blueprints that you can use to create a Lambda function for data transformation.
    General Firehose Processing — Contains the data transformation and status model described in the previous section. Use this blueprint for any custom transformation logic.
    Apache Log to JSON — Parses and converts Apache log lines to JSON objects, using predefined JSON field names.
    Apache Log to CSV — Parses and converts Apache log lines to CSV format.
    Syslog to JSON — Parses and converts Syslog lines to JSON objects, using predefined JSON field names.
    Syslog to CSV — Parses and converts Syslog lines to CSV format.
    Kinesis Data Firehose Process Record Streams as source — Accesses the Kinesis Data Streams records in the input and returns them with a processing status.
    Kinesis Data Firehose CloudWatch Logs Processor — Parses and extracts individual log events from records sent by CloudWatch Logs subscription filters.

--- ---
--- Kinesis Analytics
--- ---

Streaming Sources:
* Amazon Kinesis Data Streams
* Amazon Kinesis Data Firehose

Destination:
* Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk)
* AWS Lambda
* Amazon Kinesis Data Streams

Stagger Windows
    Using stagger windows is a windowing method that is suited for analyzing groups of data that arrive at inconsistent times. It is well suited for any time-series analytics use case, such as a set of related sales or log records.
    For example, VPC Flow Logs have a capture window of approximately 10 minutes. But they can have a capture window of up to 15 minutes if you're aggregating data on the client. Stagger windows are ideal for aggregating these logs for analysis.
    Stagger windows address the issue of related records not falling into the same time-restricted window, such as when tumbling windows were used.
Tumbling Windows (Aggregations Using GROUP BY)
    When a windowed query processes each window in a non-overlapping manner, the window is referred to as a tumbling window. In this case, each record on an in-application stream belongs to a specific window. It is processed only once (when the query processes the window to which the record belongs).
    Tumbling Windows query, A query that aggregates data using distinct time-based windows that open and close at regular intervals.
Sliding Windows
    Sliding windows query, A query that aggregates data continuously, using a fixed time or rowcount interval.
Continuous Queries
    Continuous Query is a query over a stream executes continuously over streaming data. This continuous execution enables scenarios, such as the ability for applications to continuously query a stream and generate alerts.

--- ---
--- SQS
--- ---

MAX message size 256Kb
MAX retention period 14 days

--- ---
--- IoT
--- ---

IoT Rule Engine can send data to:
* Elasticsearch
* Kinesis Firehose
* Kinesis Stream
* DynamoDB
* ML
* CloudWatch
* S3
* SQS
* SNS
* Lambda

AWS IoT Authentication
AWS IoT supports four types of identity principals for authentication:
    X.509 certificates
    IAM users, groups, and roles
    Amazon Cognito identities
    Federated identities

Typically, AWS IoT devices use X.509 certificates, while mobile applications use Amazon Cognito identities. Web and desktop applications use IAM or federated identities.

--- ---
--- Data Pipeline
--- ---

Move data across regions

AWS Data Pipeline supports the following types of databases:
1) JdbcDatabase - A JDBC database
2) RdsDatabase - An Amazon RDS database
3) RedshiftDatabase - An Amazon Redshift database

--- ---
--- DynamoDB
--- ---

Use cases:
* Mobile
* Web
* Gaming
* IoT
* Live online voting
* Session management
* Store S3 object metadata

MAX item size = 400KB

1 WCU(Write Capacity Unit) = 1KB blocks write per secong. Round to 1KB(500K read will take 1 WCU)
1 RCU(Read Capacity Unit) = 4KB blocks read per second. Round to 4KB(2KB read will take 1 RCU)

Strong Consistent Read = 1 RCU
Eventually Consistent Read  = 0.5 RCU

Partition Key = Hash Key
Sort Key = Range Key
Primary Key = Partition Key only = Partition Key + Sort Key

Integration:
* RedShift - direct COPY from DynamoDB
* EMR - DynamoDB integrated with Hive(read/write)
* S3 - export/import from/to DynamoDB via Data Pipeline
* Lambda - can be triggered from DynamoDB
* Kinesis Stream - KCL from Stream to DynamoDB

Partition:
* Can store 10Gb of data MAXIMUM
* 3000 RCU MAX
* 1000 WCU MAX

New partition create when >10GB or >3000RCU or >1000WCU

Total partitions for desired performance = (Desired RCU / 3000 RCU) + (Desired WCU / 1000 WCU)
Total partitions for desired storage = Desired capacity in GB / 10GB
Total partitions = MAX(Total partitions for desired performance, Total partitions for desired capacity)

--- ---
--- EMR HDFS
--- ---

Default block size = 64Mb
We can set block size per file

Default replication factor:
* <= 3 nodes = 1
* 4-9 nodes = 2
* >= 10 nodes = 3

--- ---
--- EMR Hive
--- ---

It is Data Warehous infrastructure built on top of hadoop
It allows to summarize, query and analyze very large data sets
Uses SQL-like interface (HiveSL)

Use cases:
* Process and analyze logs
* Join very large tables
* Batch jobs
* Ad-hoc interactive queries over HDFS/S3 data

You can connect to Hive via JDBC/ODBC drivers

Hive on EMR (in comparing to classic Apache Hive) has additional integrations:
* S3
* * read and write data to/from S3
* * EMRFS
* * Partitioning in Hive. Partitioning supported with S3. (s3://///
/somelogfilenames.log)
* DynamoDB(via EMR DynamoDB connector)
* * Join Hive and DynamoDB tables using HiveQL
* * Query data in DynamoDB tables using HiveQL
* * Copy data from DynamoDB tables to HDFS vice versa
* * Copy data from DynamoDB to S3
* * Copy data from S3 to DynamoDB
* Kinesis Streams

SerDe(Serializer/Desirealizer)
* It is like predefined/custom expression for data format conversion, something like that

--- ---
--- EMR Tez
--- ---

Представляет собой альтернативу традиционному MapReduce, который позволяет выполнять задания для удовлетворения требований быстрого реагирования и максимальной пропускной способности в масштабе петабайт.

It is default execution engine in EMR > 5.x (instead of MapReduce)
It is works on top of YARN

--- ---
--- EMR EC2 choosing
--- ---

MapReduce:
* batch oriented
* M3 or M4 instance types
* scale horisontally

Machine Learning:
* P2 instance types
* C3 or C4 instance types

Spark:
* R3 or R4 instance types

HDFS/MapRedure with high IO performance or high IOPS
* I2 or I3
* D2

Master node:
* <= 50 slave nodes: m3.xlarge or m4.xlarge
* >= 50 slave nodes: m3.2xlarge or m4.2xlarge

--- ---
--- EMR Oozie
--- ---

Scheduler. Workflow management system to manage hadoop jobs

--- ---
--- EMR Sqoop
--- ---

Database migration tool, from relational databases to hadoop

--- ---
--- EMR HBase
--- ---

Intergated with:
* Apache Hadoop
* Apache Hive
* Apache Phoenix
* HDFS
* EMRFS

Use cases:
* Massive data storage
* High speed of data processing

When to use HBase:
* Large amount of data - 100s of GBs to PBs
* High write throughput and update rates
* NoSQL, flexible schema
* Fast access to data, random and real-time.
* Fault tolerance in non-relational environment

When DO NOT USE HBase:
* Transactional applications
* Relational database type features
* Small amount of data

--- ---
--- EMR Phoenix
--- ---

SQL and JDBC API for HBase. OLAP and OLTP

--- ---
--- EMR Presto
--- ---

* Open-source in-memory distributed fast SQL query engine
* Run interactive analytic queries against a variety of data sources with sizes ranging from GBs to PBs.
* Faster than Hive

Advantages of using Presto:
* You can query different type of datasources - from RDBMS, NoSQL, framevorks like Hive to stream processing platform like Kafka, Cassandra, Hive, Kafka, MongoDB, MsSQL, PostgresSql, Redis
* High concurency, run thousand of queries per day
* In-memosy processing helps avoid unnecessary I/O, leading to low latency
* Queries runs directly on Presto engine, they do not need enterpreper like Hive does(Tez)

Metadata of Presto on EMR is stored in Glue Data Catalog

When not to use:
* Not database and not designed for OLTP
* Joining very large(100M plus rows) requires optimisation(use Hive instead)
* Batch processing

--- ---
--- EMR Spark
--- ---

Use cases:
* Interactive Analytics(Faster than run queries on Hive)
* Flexibility in terms of languages (Scala, Python etc)
* Flexibility in data sources(JDBC/ODBC; Files ORC,Parquet, AVRO, JSON; Hive tables etc) with data joins between them
* Run query against live data(Spark 2.0)
* * Structured Streaming
* Stream Processing
* * Disparate data sources
* * Small sizes
* * Process in real time for analyzisys
* Machine learning(MLLib)
* * Recomendation engine
* * Fraud detection
* * Customer segmentation
* * Security
* Data integration
* * ETL

When not to use:
* Not a database and not for OLTP
* Batch processing
* Many concurent requests(avoid for large multi-user reporting environments with high concurency)
* * Run ETL in Spark and copy the data to typical reporting database
* * Run batch jobs in Hive instead

Spark SQL:
* Run low latency interactive SQL queries against structured data
* RDD and DataFrame API for access datasources using Scala, Python, R or SQL
* Avro, Parquet, ORC, JSON
* JOIN across data sources
* Support querying Hive tables using HiveQL
* Quesy database via JDBC/ODBC

Spark Streaming:
* Sources:
* * Kafka
* * Flume
* * HDFS/S3
* * Kinesis
* * Twitter
* Targets:
* * HDFS
* * Databases
* * Dashboards

Spark MLLib algorithms:
    Classification: logistic regression, naive Bayes,...
    Regression: generalized linear regression, survival regression,...
    Decision trees, random forests, and gradient-boosted trees
    Recommendation: alternating least squares (ALS)
    Clustering: K-means, Gaussian mixtures (GMMs),...
    Topic modeling: latent Dirichlet allocation (LDA)
    Frequent itemsets, association rules, and sequential pattern mining

Spark on EMR integrates with services:
* DynamoDB(EMR-DynamoDB connector)
* RDS(JDBC data source with Spark SQL)
* ElasticSearch(elasticsearch connector)
* RedShift(Amazon RedShift Copy from HDFS)
* Kafka(streaming data connector)
* Kinesis(streaming data connector)
* S3(EMRFS)

--- ---
--- EMR File Storage and Compression
--- ---

Algorithm | Splittable ? | Compression Ratio | Compress/Decompress Speed |
----------|--------------|-------------------|---------------------------|
GZIP      | No           | High              | Medium                    |
bzip2     | Yes          | Very High         | Slow                      |
LZO       | Yes          | Low               | Fast                      |
Snappy    | No           | Low               | Very fast                 |

Files format supported by EMR
* Text (csv, tsv)
* Parquet (Columnar oriented file format)
* ORC (Optimized Row Columnar file format)
* Sequence (Flat files consisting of binary key/value pairs)
* Avro (Data serialization framework)

File sizes:
* GZIP files are not splittable, keep them in 1-2Gb Range
* Avoid using smaller files (100MB or less), plan for fewer files
* S3DictCp can be used to combine small files into larger files

--- ---
--- EMR Security
--- ---

At-rest data encryption for ec2 instances(instance store volumes and EBS volumes):
* open-source HDFS encryption
* LUKS encryption(Linux unified key setup)
Two files will be undated:
* /etc/hadoop/conf/core-site.xml
* * hadoop.rpc.protection   privacy
* /etc/hadoop/conf/hdfs-site.xml
* * dfs.encrypt.data.transfer   true

EMRFS encryption at-rest:
* SSE-S3
* SSE-KMS
* CSE-KMS
* CSE-Custom

--- ---
--- HCatalog
--- ---

Allow you to access Hive metastore tables with BigData tools:
* Pig
* SparkSQL
* Custom MapReduce applications

--- ---
--- AWS Glue
--- ---

Fully managed ETL service:
* Categorize, clean and enrich your data
* Move data between various data sources
* Provide data catalog

--- ---
--- RedShift
--- ---

Block size = 1Mb
Max Concurent user connections = 500
Total concurency level for all user-defined connections = 50
Max number of user-defined queues = 8
Super user queue = 1
Default concurency per queue = 5
Zone maps = min and max value for each block

HSM encryption is not supported for DC2 node types.

Use cases:
* Data warehouse
* BI application, usually used by users in business hours

Data nodes types:
* dc2.large   2vCPU,  15Gb RAM   0.16Tb SSD, 1-32 nodes,  Total capacity = 5.12Tb,      2 slices
* dc2.8xlarge 32vCPU, 244Gb RAM  2.56Tb SSD, 2-128 nodes, Total capacity = 327.68Tb,    16 slices(dc1.8xlarge - 32 slices)
* ds2.large   4vCPU,  31Gb RAM   2Tb HDD,    1-32 nodes,  Total capacity = 64Tb,        2 slices
* ds2.8xlarge 36vCPU, 244Gb RAM  16Tb HDD,   2-128 nodes, Total capacity = 2Pb(2048Tb), 16 slices

Integrates with:
* S3(COPY and UNLOAD)
* DynamoDB(COPY to RedShift)
* EMR(COPY to RedShift)
* EC2(COPY to RedShift via SSH)
* AWS Data Pipeline(Jobs on Data Pipeline)
* Lambda(COPY files, manage snapshots etc)
* QuickSight
* AWS Mirgation Service
* Kinesis(Streams, Firehose, Analytics)
* Amazon Machine Learning

When not to use columnar datrabase
* small amount of data(less than 100Gb)
* OLTP
* Binasy Large Objects

Sort key types:
* single column sort key
* compound:
* * all rows mentioned in sort key definition
* * it is default sort key
* * JOINS
* * ORDER BY, GROUP BY
* * PARTITION BY, ORDER BY window function
* * table sorted by columns listed in sort key
* * poor performance if the query does not include primary sort column
* interleaved
* * equal weight for each column in the sortkey
* * multiple queries/different filters
* * table maintenanse operarions - data load / vacuum are slower
* * use only with very large tables - 100 million plus rows
* * not good for data loaded in sort order

Recomendation about column for sort keys:
* TIMESTAMP column for recent data
* Columns that used in BETWEEN conditions and equality operations
* Column that are involved in a JOIN - Same column can be used for the Distribution Key and for the Sort Key

WLM Types(WorkLoad Management)
* User group - user groups assigned to WLM Group
* Query group - query assigned to WLM Group

Direct COPY to RedShift accessible from:
* S3
* EC2
* EMR
* DynamoDB

Copy to RedShift via S3:
* Kinesis Firehose
* Kinesis enabled application
* Database Migration Service

Amazon RedShift Database Loader - use Lambda to load data from S3 on even driven base

Files format for COPY comand:
* csv
* delimited (data separated by "|"), often *.tbl
* fixed width
* json
* Avro

Loading encrypted data:
* SSE-S3(auto recognised by COPY)
* SSE-KMS(auto recognised by COPY)
* client-side encryption using a client-side symmetric master key
Not supported by COPY:
* SSE-C(server side encription with customer-provided key)
* Client-side encryption using a KMS-managed customer master key
* Client-side encryption using a customer-provided asymmetric master key

UNLOAD command:
* unload to S3
* automatically create files usiong SSE-S3
* you can specify SSE-KMS
* you can specify customer-side encryption with a customer-managerd key (CSE-CMK)
UNLOAD not support:
* SSE-C - S3 server side encryption using customer-supplied key

Enhanced VPC Routing - avoid traffic from S3 to RedShift via internet(you will use VPC endpoint or NAT Gateway)

VACUUM utility is not recommended for tables over 700Gb, use deep copy instead.

Automatic snapshots takes every 8 hours or every 5 Gb of data changes by default.
To enable cross region KMS encrypted snapshot copy you need to create snapshot copy grant in the destination region

Encryption at rest:
* KMS
* HSM (CloudHSM or On-Prem HSM)
* You encrypted data blocks, system metadata and snapshots
* Encription is immutable. To create ununcrypted cluster from encrypted you need recreate cluster and reload data with UNLOAD/COPY

Encryption at rest - KMS:
4 tier hierarhy of encryption keys:
* master key
* cluster encription key
* database encription key
* data encryption key

STL system tables are generated from Amazon Redshift log files to provide a history of the system.
STV tables are actually virtual system tables that contain snapshots of the current system data.

Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition to database security, which is described in this section, Amazon Redshift provides these features to manage security:
* Sign-in credentials — Access to your Amazon Redshift Management Console is controlled by your AWS account privileges. For more information, see Sign-In Credentials.
* Access management — To control access to specific Amazon Redshift resources, you define AWS Identity and Access Management (IAM) accounts. For more information, see Controlling Access to Amazon Redshift Resources.
* Cluster security groups — To grant other users inbound access to an Amazon Redshift cluster, you define a cluster security group and associate it with a cluster. For more information, see Amazon Redshift Cluster Security Groups.
* VPC — To protect access to your cluster by using a virtual networking environment, you can launch your cluster in an Amazon Virtual Private Cloud (VPC). For more information, see Managing Clusters in Virtual Private Cloud (VPC).
* Cluster encryption — To encrypt the data in all your user-created tables, you can enable cluster encryption when you launch the cluster. For more information, see Amazon Redshift Clusters.
* SSL connections — To encrypt the connection between your SQL client and your cluster, you can use secure sockets layer (SSL) encryption. For more information, see Connect to Your Cluster Using SSL.
* Load data encryption — To encrypt your table load data files when you upload them to Amazon S3, you can use either server-side encryption or client-side encryption. When you load from server-side encrypted data, Amazon S3 handles decryption transparently. When you load from client-side encrypted data, the Amazon Redshift COPY command decrypts the data as it loads the table. For more information, see Uploading Encrypted Data to Amazon S3.
* Data in transit — To protect your data in transit within the AWS cloud, Amazon Redshift uses hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.

--- ---
--- Redshift Spectrum
--- ---

Redshift Spectrum supports the following structured and semistructured data formats:
    AVRO
    PARQUET
    TEXTFILE
    SEQUENCEFILE
    RCFILE
    RegexSerDe
    Optimized row columnar (ORC)
    Grok
    OpenCSV
    Ion
    JSON

--- ---
--- Machine Learning
--- ---

Use cases:
* Fraud detection
* Customer Service
* Letigation/Legal
* Security
* Healthcare
* Sports predictions

Unsupervised Learning
* Unlabeled data
* No knowledge of output
* "Self-guided" learning algorithm
* Aum: Figure out the data patterns and groupings
* Does not supported by Amazon ML
* You can use EMR Spart/MLLib, that supports Unsupervised learning

Supervised Learning
* Labeled data
* Desired output is known
* Providing the algorithm training data to learn from
* Aim: Predictive analytics
* Amazon Machine Learning

Datasources for Amazon ML:
* S3(csv)
* RDS (MySQL)
* RedShift

Multiclass classification model:
* Use to generate prediction for multiple classes
* Average F1 score ranges from 0 to 1 and higher F1 score - better ML model quality

Binary classification model:
* true/false predictions
* AUC metric - Area under occur
* AUC from 0 to 1, as AUC closer to 1, as model is better
* AUC closer to 1 - good model
* AUC 0.500 - model equal to random choise
* AUC closer to 0 - you have an issue with data
* You can use your model if AUC > 0.5

Regression model:
* Predict numeric value
* RMSE number: root mean square error
* Lower RMSE - better ML quality model(lower than baseline)

--- ---
--- ElasticSearch
--- ---

Use cases:
* search
* logging and analysis
* distributed document store
* real time application monitoring
* clickstream weblog ingestion

Integration:
IoT              -> ElasticSearch
S3               -> Labmda -> ElasticSearch
DynamoDB         -> Labmda -> ElasticSearch
Kinesis Stream   -> Labmda -> ElasticSearch
Kinesis Firehose -> Labmda -> ElasticSearch
CloudWatch       -> Labmda -> ElasticSearch

AWS recomends to have 3 dedicated master nodes for each es service domain

--- ---
--- Athena
--- ---

Analyza data on S3 using standart SQL
* Full standart SQL support using Presto
* Hive is supported by Athena for DDL. Create/delete/modification of tables/partitions.
* Metadata is stored in Glue

Data format supported:
* CSV, TSV, JSON, Textfile
* ORC
* Parquet

Compression supported:
* Snappy
* ZLib
* LZO
* GZIP

--- ---
--- QuickSight
--- ---

* Build visualisations
* Ad-hoc analysis

Supported data sources:
* Redshift
* Aurora
* Athena
* RDS(MariaDB 10.0 >, SQL Server 2012 >, MySQL 5.1 >, PortgreSQL 9.3.1 >)
* Databases on ec2 or on-prem(SQL Server, MySQL, PostgreSQL)
* Files (S3 or on-prem)
* * CSV
* * TSV
* * Common log format (*.clf)
* * Extended log format (*.elf)
* Software as a service
* * Salesforce
* * * Enterprise
* * * Unlimited
* * * Developer

Spice(Super-fast in-memory calculation engine)
* 10Gb per user by default at start

Data preparation:
* change field names
* add calculating fields
* sql queries
* join tables (from same data source)
* change data types
* data set checks(skip unused rows, that can not be enterpreted properly)

20 visuals per Analyysis

Visual Types:
* AutoGraph
* Bar charts horisontal and vertical
* * single-measure - ome measure for one dimention
* * multi-measure - 2 or more measure for one dimention
* * clustered - values for a dimention grouped by related dimention
* Line Charts(compare chenges over perion of time) - gross sales by month, gross sales and net sales by month, mesaures for dimention over period of time
* Area line chart - same as line chart, but area under line are fulfilled by colors
* Pivot table - way to summarize data. You can change row sort order, apply statistical functions
* Scatter plot
* * two or three measures for a dimention
* * each bobble replresents an item in the dimention
* Tree map
* Pie chart
* Heat map (identify trends and outliers)

--- ---
--- Zeppelin
--- ---

Use cases:
* Zeppelin, SparkSQL, MLLib on EMR can be use together for exploratory Data Science and for recomendation engine
* Kinesis Streams, Zeppelin and Spark Streaming can be used together for analyzing realtime data

Zeppelin for Spark
Jupiter for Python

--- ---
--- D3.js
--- ---

Read data from csv, tsv, or json

--- ---
--- Microstrategy
--- ---

business intelligence and analytics software
Help business to make better decisions
Dashboards, scorecards, reports, queries
Web, desctop, mobile

Sources:
* Athena
* Aurora
* EMR
* Redshift

--- ---
--- Apache Impala
--- ---

Apache Impala is the open source, native analytic database for Apache Hadoop

###########################################
# Relations
###########################################

Kinesis Data Firehose Destinations:
    * S3
    * Redshift
    * Amazon ElasticSearch
    * Splunk

###########################################
# Limits
###########################################

Kinesis Data Firehose
https://docs.aws.amazon.com/firehose/latest/dev/limits.html

###########################################
# LinuxAcademy
###########################################

###########################################
# Udemy
###########################################

###########################################
# ACloudGuru
###########################################

INGESTED: Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
INGESTED: https://aws.amazon.com/ru/blogs/big-data/querying-amazon-kinesis-streams-directly-with-sql-and-spark-streaming/

INGESTED: Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library
INGESTED: https://aws.amazon.com/ru/blogs/big-data/implementing-efficient-and-reliable-producers-with-the-amazon-kinesis-producer-library/

INGESTED: Amazon Kinesis Firehose Data Transformation with AWS Lambda
INGESTED: https://aws.amazon.com/ru/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/

INGESTED: Core Tenets of IoT
INGESTED: https://d0.awsstatic.com/whitepapers/core-tenets-of-iot1.pdf

INGESTED: Amazon QuickSight Update – Geospatial Visualization, Private VPC Access, and More
INGESTED: https://aws.amazon.com/ru/blogs/aws/amazon-quicksight-update-geospatial-visualization-private-vpc-access-and-more/

INGESTED: Store, Protect, Optimize Your Healthcare Data with AWS: Part 1
INGESTED: https://aws.amazon.com/ru/blogs/architecture/store-protect-optimize-your-healthcare-data-with-aws/

INGESTED: Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes
INGESTED: https://aws.amazon.com/ru/blogs/big-data/scaling-writes-on-amazon-dynamodb-tables-with-global-secondary-indexes/

Improving Data Access with Secondary Indexes
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html

DynamoDB Cross-Region Replication
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html

Best Practicesfor AmazonEMR
https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf

Best Practices for Securing Amazon EMR
https://aws.amazon.com/ru/blogs/big-data/best-practices-for-securing-amazon-emr/

View Web Interfaces Hosted on Amazon EMR Clusters
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html

Comparing the Use of Amazon DynamoDB and Apache HBase for NoSQL
https://d1.awsstatic.com/whitepapers/AWS_Comparing_the_Use_of_DynamoDB_and_HBase_for_NoSQL.pdf

Access HBase Tables with Hive
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase-access-hive.html

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR
https://aws.amazon.com/ru/blogs/big-data/combine-nosql-and-massively-parallel-analytics-using-apache-hbase-and-apache-hive-on-amazon-emr/

Analyze Your Data on Amazon DynamoDB with Apache Spark
https://aws.amazon.com/ru/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming
https://aws.amazon.com/ru/blogs/big-data/analyze-realtime-data-from-amazon-kinesis-streams-using-zeppelin-and-spark-streaming/

Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams
https://aws.amazon.com/ru/blogs/big-data/optimize-spark-streaming-to-efficiently-process-amazon-kinesis-streams/

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/ru/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Using Spark SQL for ETL
https://aws.amazon.com/ru/blogs/big-data/using-spark-sql-for-etl/

A Zero-Administration Amazon Redshift Database Loader
https://aws.amazon.com/ru/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/

Ad Hoc Big Data Processing Made Simple with Serverless MapReduce
https://aws.amazon.com/ru/blogs/compute/ad-hoc-big-data-processing-made-simple-with-serverless-mapreduce/

Using AWS Lambda for Event-driven Data Processing Pipelines
https://aws.amazon.com/ru/blogs/big-data/using-aws-lambda-for-event-driven-data-processing-pipelines/

Using AWS Lambda with Amazon DynamoDB
https://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html

Amazon Kinesis Firehose Data Transformation with AWS Lambda
https://aws.amazon.com/ru/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/

Using AWS Lambda with Amazon Kinesis
https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html

Implementing a Serverless AWS IoT Backend with AWS Lambda and Amazon DynamoDB
https://aws.amazon.com/ru/blogs/compute/implementing-a-serverless-aws-iot-backend-with-aws-lambda-and-amazon-dynamodb/

Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python
https://aws.amazon.com/ru/blogs/database/indexing-metadata-in-amazon-elasticsearch-service-using-aws-lambda-and-python/

Data Warehousing on AWS
https://d0.awsstatic.com/whitepapers/enterprise-data-warehousing-on-aws.pdf

Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization
https://aws.amazon.com/ru/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift
https://aws.amazon.com/ru/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/

Loading Encrypted Data Files from Amazon S3
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html

Unloading Encrypted Data Files
https://docs.aws.amazon.com/redshift/latest/dg/t_unloading_encrypted_files.html

Amazon Redshift Clusters
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html

Analyze & Vacuum Schema Utility
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AnalyzeVacuumUtility

Performing a Deep Copy
https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

Amazon Redshift Snapshot Manager
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/SnapshotManager

Redshift Advance Monitoring
https://github.com/awslabs/amazon-redshift-monitoring

Factors Affecting Query Performance
https://docs.aws.amazon.com/redshift/latest/dg/c-query-performance.html

Building a Multi-Class ML Model with Amazon Machine Learning
https://aws.amazon.com/ru/blogs/big-data/building-a-multi-class-ml-model-with-amazon-machine-learning/

Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift
https://aws.amazon.com/ru/blogs/big-data/building-a-binary-classification-model-with-amazon-machine-learning-and-amazon-redshift/

Building a Numeric Regression Model with Amazon Machine Learning
https://aws.amazon.com/ru/blogs/big-data/building-a-numeric-regression-model-with-amazon-machine-learning/

How to Visualize and Refine Your Network’s Security by Adding Security Group IDs to Your VPC Flow Logs
https://aws.amazon.com/ru/blogs/security/how-to-visualize-and-refine-your-networks-security-by-adding-security-group-ids-to-your-vpc-flow-logs/#more-3559

Managing Amazon Elasticsearch Service Domains
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming
https://aws.amazon.com/ru/blogs/big-data/analyze-realtime-data-from-amazon-kinesis-streams-using-zeppelin-and-spark-streaming/

Run Jupyter Notebook and JupyterHub on Amazon EMR
https://aws.amazon.com/ru/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

Hadoop: Encrypted Shuffle
https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html

Amazon Redshift Database Encryption
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html


###########################################
# Whizlabs
###########################################

INGESTED: Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift:
INGESTED: https: //aws.amazon.com/ru/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

INGESTED: Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR
INGESTED: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMRforDynamoDB.html

INGESTED: Amazon Redshift Snapshots
INGESTED: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html

INGESTED: Strategies for Reducing Your Amazon EMR Costs
INGESTED: https://aws.amazon.com/ru/blogs/big-data/strategies-for-reducing-your-amazon-emr-costs/

INGESTED: Best Practices for Storing Large Items and Attributes
INGESTED: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-use-s3-too.html

INGESTED: Rules for AWS IoT
INGESTED: https://docs.aws.amazon.com/iot/latest/developerguide/iot-rules.html

INGESTED: Choosing the Right DynamoDB Partition Key
INGESTED: https://aws.amazon.com/ru/blogs/database/choosing-the-right-dynamodb-partition-key/

INGESTED: Processing DynamoDB Data With Apache Hive on Amazon EMR
INGESTED: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html

INGESTED: Озера данных и аналитика в AWS
INGESTED: https://aws.amazon.com/ru/big-data/datalakes-and-analytics/

INGESTED: Amazon Redshift Can Now COPY from Parquet and ORC File Formats
INGESTED: https://aws.amazon.com/ru/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/

How to access and analyze on-premises data stores using AWS Glue
https://aws.amazon.com/ru/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/

INGESTED: Optimizing Performance for Amazon EMR Operations in DynamoDB
INGESTED: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Optimizing.html

EMR Kinesis
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-kinesis.html

Processing VPC Flow Logs with Amazon EMR
https://aws.amazon.com/ru/blogs/big-data/processing-vpc-flow-logs-with-amazon-emr/

Encryption of Data at Rest for Amazon Elasticsearch Service
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/encryption-at-rest.html

Best Practices for Querying and Scanning Data
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-query-scan.html

Читать далее