Outline:

Strataとは?

Strataとは、O’ReillyとClouderaにが主催する、 大規模なデータ・最先端のデータサイエンス・新しいビジネスファンダメンタルズを活用するための、データカンファレンスである。 年に3回ぐらい行われている。

今回は私が行ってきたのは、San Joseで開催されたStrataに参加してきた。

カンファレンスのハイライト

僕のタイムライン

感想

  • 理論というよりは実践的なセッションが多かった
  • Netflix, Pinterest, LindkedIn, Uberなどのセッションが多っかたし、人気だった
  • Stream処理、リアルタイム系、Apache Kafka, Sparkなどの実例のセッション
  • 全体的に、「データサイエンティスト」と「データエンジナーリング」と別れるらしい。参加者もセッションの内容も

ピックアップセッション

  • Stream processing with Kafka
    • Speaker: Tim Berglund (Confluent)
    • Level: Beginner
    • Audience: Developers who want to use Kafka Streams
    • Learn: Understand Kafka architecture
    • Slide: here
  • 20 Netflix-style principles and practices to get the most out of your data platform
    • Speaker: Kurt Brown (Netflix)
    • Level: Intermediate
    • Audience: Anyone who manages or interacts with (big) data infrastructure
    • Learn: Explore 20 principles and practices to get the most out of your data infrastructure
    • Slide: here
    • Video: here
    • What’s new (for me):
      • Genie, open source distributed job orchestration engine developed by Netflix
        • Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more.
        • It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.
        • demo
  • Modern real-time streaming architectures
    • Speaker: Karthik Ramasamy (Streamlio, (Twitter)), Arun Kejariwal (MZ, (Twitter))
    • Level: Beginner
    • Audience: Software engineers and engineering managers
    • Learn:
      • Understand stream processing fundamental concepts
      • Explore the different types of streaming architectures along with their pros and cons
    • Slide: here
    • What’s new (for me):
      • Heron, A realtime, distributed, fault-tolerant stream processing engine from Twitter.
      • Apache Pulsar, is an open-source distributed pub-sub messaging system originally created at Yahoo and now part of the Apache Software Foundation.
      • Apache Bookkeeper, A scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads
      • Data Sketches, Analyzing Big Data Quickly with sketch algorithms.
  • Deep learning-based search and recommendation systems using TensorFlow
    • Speaker: Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
    • Level: Intermediate
    • Audience: Data scientists, data engineers, data architects, and CxOs
    • Learn: Gain an end-to-end view of deep learning-based recommendation and learning-to-rank systems using TensorFlow
    • Slide: here
    • Code: here
    • What’s new (for me):
      • JupyterHub, a multi-user Hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group.
  • Accelerating development velocity of production ML systems with Docker
    • Speaker: Kinnary Jangla (Pinterest)
    • Level: Intermediate
    • Audience: Machine learning engineers, data scientists, managers working with ML, and site reliability engineers
    • Learn: Explore how Pinterest dockerized the services powering its home feed to accelerate development and decrease operational complexity
    • Slide: here
    • What’s new (for me):
  • The secret sauce behind LinkedIn’s self-managing Kafka clusters
    • Speaker: Jiangjie Qin (LinkedIn)
    • Level: Intermediate
    • Audience: Kafka users and distributed system developers and administrators
    • Learn:
      • Learn how LinkedIn automates its Kafka operation at scale
      • Discover how to model a workload and balance a stateful distributed system at a fine granularity
    • Slide: here
    • What’s new (for me):
      • Cruise Control Architecture is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
  • Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists
    • Speaker: Stephen O’Sullivan (Data Whisperers)
    • Level: Intermediate
    • Audience: Data scientists and data scientists in training
    • Learn: Gain an understanding of data engineering to improve productivity and the relationship between data scientists and data engineers
    • Slide: here
    • What’s new (for me):
  • Lyft’s analytics pipeline: From Redshift to Apache Hive and Presto
    • Speaker: Shenghu Yang (Lyft)
    • Level: Intermediate
    • Audience: Data engineers, analysts, and data scientists
    • Learn: Explore the evolution of Lyft’s data pipeline, from AWS Redshift clusters to Apache Hive and Presto
    • Slide: here
    • What’s new (for me):
      • Druid is a high-performance, column-oriented, distributed data store.
  • Detecting time series anomalies at Uber scale with recurrent neural networks
    • Speaker: Andrea Pasqua (Uber), Anny Chen (Uber)
    • Level: Intermediate
    • Audience: Data scientists, product managers, and executives
    • Learn: Learn how Uber applies recurrent neural networks to time series analysis
    • Slide: here
    • What’s new (for me):
  • Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously
    • Speaker:
    • Level: Intermediate
    • Audience: Data engineers, software engineers, architects, project managers, machine learning engineers, data scientists, and data users
    • Learn: Learn how Pinterest solved the problem of moving hundreds of terabytes of MySQL data offline on a daily basis to power continuous computation
    • Slide: here
    • What’s new (for me):
  • Big data analytics and machine learning techniques to drive and grow business
    • Speaker: Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
    • Level: Beginner
    • Audience: Business leaders, researchers, and practitioners
    • Learn:
      • Understand the big data analytics lifecycle
      • Learn how to utilize state-of-the-art techniques to drive and grow business
    • Slide: here
    • What’s new (for me):