Strata Data Conference 2018

Outline:

Strataとは?

Strataとは?

Strataとは、O’ReillyとClouderaにが主催する、 大規模なデータ・最先端のデータサイエンス・新しいビジネスファンダメンタルズを活用するための、データカンファレンスである。年に３回ぐらい行われている。

今回は私が行ってきたのは、San Joseで開催されたStrataに参加してきた。

カンファレンスのハイライト

僕のタイムライン

TUESDAY EVENTS (3/6)
- Tutorial: Modern Real Time Streaming Architectures
- Tutorial: Deep Learning Based Search and Recommendation Systems Using TensorFlow
  - code
- Ignite—Join us for a fun, high-energy evening of five-minute talks—all aspiring to live up to the Ignite motto: Enlighten us, but make it quick.
WEDNESDAY EVENTS (3/7)
- Keynote
- Sessions
- Booth Crawl
- Data After Dark: Night at the Market—Join us at San Pedro Square Market for an exciting evening filled with cocktails, food, and live entertainment! Be sure to bring your badge.
THURSDAY EVENTS (3/8)
- Speed Networking—Enjoy casual conversation while meeting fellow attendees.
- Keynote
- Sessions
- Booth Crawl

感想

理論というよりは実践的なセッションが多かった
Netflix, Pinterest, LindkedIn, Uberなどのセッションが多っかたし、人気だった
Stream処理、リアルタイム系、Apache Kafka, Sparkなどの実例のセッション
全体的に、「データサイエンティスト」と「データエンジナーリング」と別れるらしい。参加者もセッションの内容も

ピックアップセッション

Stream processing with Kafka
- Speaker: Tim Berglund (Confluent)
- Level: Beginner
- Audience: Developers who want to use Kafka Streams
- Learn: Understand Kafka architecture
- Slide: here
20 Netflix-style principles and practices to get the most out of your data platform
- Speaker: Kurt Brown (Netflix)
- Level: Intermediate
- Audience: Anyone who manages or interacts with (big) data infrastructure
- Learn: Explore 20 principles and practices to get the most out of your data infrastructure
- Slide: here
- Video: here
- What’s new (for me):
  - Genie, open source distributed job orchestration engine developed by Netflix
    - Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more.
    - It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.
    - demo
Modern real-time streaming architectures
- Speaker: Karthik Ramasamy (Streamlio, (Twitter)), Arun Kejariwal (MZ, (Twitter))
- Level: Beginner
- Audience: Software engineers and engineering managers
- Learn:
  - Understand stream processing fundamental concepts
  - Explore the different types of streaming architectures along with their pros and cons
- Slide: here
- What’s new (for me):
  - Heron, A realtime, distributed, fault-tolerant stream processing engine from Twitter.
  - Apache Pulsar, is an open-source distributed pub-sub messaging system originally created at Yahoo and now part of the Apache Software Foundation.
    - about pulsar slide
  - Apache Bookkeeper, A scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads
  - Data Sketches, Analyzing Big Data Quickly with sketch algorithms.
Deep learning-based search and recommendation systems using TensorFlow
- Speaker: Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
- Level: Intermediate
- Audience: Data scientists, data engineers, data architects, and CxOs
- Learn: Gain an end-to-end view of deep learning-based recommendation and learning-to-rank systems using TensorFlow
- Slide: here
- Code: here
- What’s new (for me):
  - JupyterHub, a multi-user Hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group.
Accelerating development velocity of production ML systems with Docker
- Speaker: Kinnary Jangla (Pinterest)
- Level: Intermediate
- Audience: Machine learning engineers, data scientists, managers working with ML, and site reliability engineers
- Learn: Explore how Pinterest dockerized the services powering its home feed to accelerate development and decrease operational complexity
- Slide: here
- What’s new (for me):
The secret sauce behind LinkedIn’s self-managing Kafka clusters
- Speaker: Jiangjie Qin (LinkedIn)
- Level: Intermediate
- Audience: Kafka users and distributed system developers and administrators
- Learn:
  - Learn how LinkedIn automates its Kafka operation at scale
  - Discover how to model a workload and balance a stateful distributed system at a fine granularity
- Slide: here
- What’s new (for me):
  - Cruise Control Architecture is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists
- Speaker: Stephen O’Sullivan (Data Whisperers)
- Level: Intermediate
- Audience: Data scientists and data scientists in training
- Learn: Gain an understanding of data engineering to improve productivity and the relationship between data scientists and data engineers
- Slide: here
- What’s new (for me):
  - Data Formats: Parquet, ORC
Lyft’s analytics pipeline: From Redshift to Apache Hive and Presto
- Speaker: Shenghu Yang (Lyft)
- Level: Intermediate
- Audience: Data engineers, analysts, and data scientists
- Learn: Explore the evolution of Lyft’s data pipeline, from AWS Redshift clusters to Apache Hive and Presto
- Slide: here
- What’s new (for me):
  - Druid is a high-performance, column-oriented, distributed data store.
Detecting time series anomalies at Uber scale with recurrent neural networks
- Speaker: Andrea Pasqua (Uber), Anny Chen (Uber)
- Level: Intermediate
- Audience: Data scientists, product managers, and executives
- Learn: Learn how Uber applies recurrent neural networks to time series analysis
- Slide: here
- What’s new (for me):
Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously
- Speaker:
- Level: Intermediate
- Audience: Data engineers, software engineers, architects, project managers, machine learning engineers, data scientists, and data users
- Learn: Learn how Pinterest solved the problem of moving hundreds of terabytes of MySQL data offline on a daily basis to power continuous computation
- Slide: here
- What’s new (for me):
Big data analytics and machine learning techniques to drive and grow business
- Speaker: Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
- Level: Beginner
- Audience: Business leaders, researchers, and practitioners
- Learn:
  - Understand the big data analytics lifecycle
  - Learn how to utilize state-of-the-art techniques to drive and grow business
- Slide: here
- What’s new (for me):