Just Enough Data Weekly Newsletter 11

Oct 31, 2021

Technical Webinar with Demo - Apache Iceberg + Dremio on ADLS

Data lakes are designed to store and seamlessly make available vast amounts of data to users across the organization. However, many organizations run into challenges like effective use of data and analytics directly on the lake.

Data lakes need to hide the complexity of underlying data structures and physical storage from end users in order to maximize value.

Link

Lightning-fast SQL Queries + Transactions directly on the Data Lake

SCHEDULE (times in PST, in the San Francisco bay area):
6:50 Join Zoom, register here to receive zoom link: https://www.aicamp.ai/event/eventdetails/W2021102519

7:00 SFbayACM intro, upcoming events, introduce the speaker

7:10 presentation starts (~60 min with Q&A)
8:10 - 8:30 wrap up

TALK DESCRIPTION:
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of more and more data. A key capability needed to enable more users is the ability to hide the complexity of underlying data structures and physical data storage. The de-facto standard has been the Hive table format, released by Facebook in 2009 that addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.

Link

All about Azure from the community experts!

Link

Casting the spell: Druid in practice

Thursday, October 28, 2021, 8:40 AM - 9:10 AM PDT

In this talk, we will also share some of the best practices and tips we’ve gathered over the years. We will cover the following topics: * Data modeling * Ingestion * Retention and deletion * Query optimization

Link

Data Observability 101: Everything You Need to Know to Get Started

What is data observability and does it make sense for your stack? Here’s your go-to guide to starting on the path towards data trust at scale.

One of the most common questions I get from customers is: “How do I get started with data observability?” And for good reason. 🙂

Link

Scheduling and Timetables in Airflow

One of the most fundamental features of Apache Airflow is the ability to schedule jobs. Historically, Airflow users could schedule their DAGs by specifying a schedule_interval with a cron expression, a timedelta, or a preset Airflow schedule.
Link

Democratizing the Data Stack—Airflow for Business Workflows

Learn how Hightouch drives action in marketing & sales teams with Reverse ETL, SQL, and Apache Airflow

Link

DataHub Features Overview

DataHub is a modern data catalog built to enable end-to-end data discovery, data observability, and data governance. This extensible metadata platform is built for developers to tame the complexity of their rapidly evolving data ecosystems, and for data practitioners to leverage the full value of data within their organization.

Here’s an overview of DataHub’s current functionality. Curious about what’s to come? Check out our roadmap.

Link

Production is called as production for obvious reasons. So be certain before you apply any patch.

To keep you alert: https://www.blef.fr/data-deleted-from-production/