Just a note for this sharing session.

Big Data at Google

Speaker: Apurva Desai

Apurva Desai has over 20 years of experience in building software and managing teams. He has been with early stage startups to big sized companies. He has been focused on cloud, big data and distributed computing for the last 10+years starting with Yahoo where his team was responsible to manage 20k+ nodes of Hadoop and provide solutions to internal projects migrating to Hadoop ecosystem. At Pivotal Software, a spinoff of EMC, his team built and commercialized Pivotal’s Hadoop distribution. Most recently he managed Motorola’s mobile phone experiences powered by mobile cloud backend running on GCP. Apurva earned his Bachelor’s of engineering from University of Mumbai, India and Master of engineering from Simon Fraser University, Canada

What does Cloud 3.0 look like?

Cloud 2.0 (Assembly Required)
- VMs
- Object Store
- Databases
- Networking
Cloud 3.0
- Containers
- Messaging
- NoSQL
- Big Data

Complixities of Big Data Processing

Programming
Resource provisioning
Handling growing scale
Reliability
Deployment & Configuration
Utilization improvements
Performance tuning
Monitoring

But what you should do is focusing on programming and dig your data.

10+ Years of Tackling Big Data Problems

10+ Years of Tackling Big Data Problem

GFS (2002 ~ 2004)
MapReduce (2004 ~ 2005)
- Dataflow (GCP)
BigTable (2005 ~ 2006)
- Apache HBase
- Hadoop
- Bigtable (GCP)
Dremel (2006 ~ 2008)
- Apache Drill
- BigQuery (GCP)
PubSub (2008 ~ 2010)
- Pub/Sub (GCP)
FlumeJava (2010 ~ 2012)
- Apache Crunch
MillWheel (2012 ~ 2014)
- Apache Beam
  - Dataflow (GCP)
TensorFlow (2014 ~ now)

"Google is living a few years in the future and sending the rest of us messages." - Doug Cutting, Hadoop Co-Creator

Bridging the Waves

Capture
- Cloud 3.0
  - Pub/Sub
  - Logs, App Engine
  - BigQuery streaming
- Cloud 2.0
  - Rabbit MQ
  - Kafka
Store
- Cloud 3.0
  - Cloud Storage (objects)
  - BigQuery Storage (structured)
  - Cloud Bigtable (NoSQL HBase)
  - Cloud Datastore (NoSQL)
- Cloud 2.0
  - Cassandra
  - HBase
  - MongoDB
Process
- Cloud 3.0
  - Cloud Dataflow (stream and batch)
  - Cloud Dataproc
- Cloud 2.0
  - Hadoop & Ecosystem
    - Spark
    - Hive
Analyze
- Cloud 3.0
  - BigQuery (large scale SQL)
  - Cloud Machine Learning
- Cloud 2.0
  - Hadoop & Ecosystem
    - Spark
    - Hive
Visualize
- Cloud 3.0
  - Cloud DataLab (Python/Jupyter Notebook)
- Cloud 2.0
  - Tableau
  - Qlik

Reference Architecture

Apache Beam and Google Cloud Dataflow

Dataflow
- API Interface (SDK)
  - Dataflow model / Beam model
  - You can write you own sdk with any language
Apache Beam

Beam vs Spark

Classic Batch Processing
- Simillar
Windowing
- Beam
  - Window
  - Sum
- Spark
  - Window & Sum
Late Data
- Beam is more easy to implement in this part
Sessions

Google Cloud Dataflow

Fully-managed and auto-configured
Auto graph-optimized for best execution path
Autoscaling mid-job
Dynamic Work Rebalancing mid-job
Fault Tolerant execution of Beam pipelines

Dataproc - Managed Hadoop + Spark

Start a massive Hadoop or Spark cluster in 90 seconds
Pre emptible VMs at 30% of othe cost, Custom VMs
Per-minute billing
Separation of Storage + Compute
Incredibly fast networking

Ephemeral clusters - jobs before clusters

Deploy Cluster
Submits jobs

Separation of Storage and Compute

Based on your use cases

BigQuery

Fun BigQuery Stats

Largest query by rows => 10.5 Trillion rows
Larget query by data size => 2.1 PB
Largest storage customer => 62 PB
Streaming per second => 4.5 million

BigQuery - explained

We just rented ~9000 cores from Google for ~30 seconds
We only paid $20
Most importantly, it's hide from end users.
Users do not thins about cores.

What is BigQuery?

Serveless, Fully Managed, No-Ops Data Warehouse
Petabyte-Scale and Fast
Convenience of SQL
Externalization of Google Dremel

Cloud Pub/Sub - Asynchronous Messaging

A FULLY-MANAGED GLOBAL Publish and Subscribe service (a many-to-many queue)
Seamlessly scales to 1,000,000+ QPS
Guaranteed durable at-least-once delivery
7-day message acknowledgement window
Simple REST API makes it portable

References

The world beyond batch: Streaming 102 - O'Reilly Media

Case study: Spotify's Event Delivery System

Speaker: Jelena Pješivac-Grbović

Dr. Jelena Pješivac-Grbović is a staff software engineer in Cloud at Google, in Mountain View, CA. She is the lead for MapReduce and one of the leads of the Google Cloud Dataflow project. Jelena's research interests include large-scale data processing, distributed, and cloud computing. She is an active member of IEEE, ACM, and SWE.

Introduction

其實就是把 References 那三篇稍微帶過這樣。

Event Delivery System
- High QPS
  - ~700K events/sec in peak

References

附上一張到此一遊照 (?)

GCP VIP Customer Workshop

Donation

如果覺得這篇文章對你有幫助，除了留言讓我知道外，或許也可以考慮請我喝杯咖啡，不論金額多寡我都會非常感激且能鼓勵我繼續寫出對你有幫助的文章。

If this blog post happens to be helpful to you, besides of leaving a reply, you may consider buy me a cup of coffee to support me. It would help me write more articles helpful to you in the future and I would really appreciate it.

歐付寶
Buy Me a Coffee
BTC: 3QZy3cHvoZcszuqWK5CdCutLEJLeAX5ip6
USDT Omni: 3QZy3cHvoZcszuqWK5CdCutLEJLeAX5ip6
ETH: 0x926bf92b29de4C6E15CbCE99e0a7aB053877181b
USDT ERC20: 0x926bf92b29de4C6E15CbCE99e0a7aB053877181b
DAI: 0x926bf92b29de4C6E15CbCE99e0a7aB053877181b
LTC: MD26bUR4VsykGp8Ky2QbstejoFGJaj3hpM

Just for noting

Taiwan Customer Sharing Session - with Google Big Data Engineering team

Big Data at Google

Speaker: Apurva Desai

What does Cloud 3.0 look like?

Complixities of Big Data Processing

10+ Years of Tackling Big Data Problems

Bridging the Waves

Reference Architecture

Apache Beam and Google Cloud Dataflow

Beam vs Spark

Google Cloud Dataflow

Dataproc - Managed Hadoop + Spark

Ephemeral clusters - jobs before clusters

Separation of Storage and Compute

BigQuery

Fun BigQuery Stats

BigQuery - explained

What is BigQuery?

Cloud Pub/Sub - Asynchronous Messaging

References

Case study: Spotify's Event Delivery System

Speaker: Jelena Pješivac-Grbović

Introduction

References

Share

Donation

Related Posts