Just Enough Data Weekly Newsletter 11
Technical Webinar with Demo - Apache Iceberg + Dremio on ADLS
Data lakes are designed to store and seamlessly make available vast amounts of data to users across the organization. However, many organizations run into challenges like effective use of data and analytics directly on the lake.
Data lakes need to hide the complexity of underlying data structures and physical storage from end users in order to maximize value.
Lightning-fast SQL Queries + Transactions directly on the Data Lake
SCHEDULE (times in PST, in the San Francisco bay area):
6:50 Join Zoom, register here to receive zoom link: https://www.aicamp.ai/event/eventdetails/W2021102519
7:00 SFbayACM intro, upcoming events, introduce the speaker
7:10 presentation starts (~60 min with Q&A)
8:10 - 8:30 wrap up
TALK DESCRIPTION:
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of more and more data. A key capability needed to enable more users is the ability to hide the complexity of underlying data structures and physical data storage. The de-facto standard has been the Hive table format, released by Facebook in 2009 that addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
All about Azure from the community experts!
Casting the spell: Druid in practice
Thursday, October 28, 2021, 8:40 AM - 9:10 AM PDT
In this talk, we will also share some of the best practices and tips we’ve gathered over the years. We will cover the following topics: * Data modeling * Ingestion * Retention and deletion * Query optimization
Data Observability 101: Everything You Need to Know to Get Started
What is data observability and does it make sense for your stack? Here’s your go-to guide to starting on the path towards data trust at scale.
One of the most common questions I get from customers is: “How do I get started with data observability?” And for good reason. 🙂
Scheduling and Timetables in Airflow
One of the most fundamental features of Apache Airflow is the ability to schedule jobs. Historically, Airflow users could schedule their DAGs by specifying a schedule_interval
with a cron expression, a timedelta, or a preset Airflow schedule.
Link
Democratizing the Data Stack—Airflow for Business Workflows
Learn how Hightouch drives action in marketing & sales teams with Reverse ETL, SQL, and Apache Airflow
DataHub Features Overview
DataHub is a modern data catalog built to enable end-to-end data discovery, data observability, and data governance. This extensible metadata platform is built for developers to tame the complexity of their rapidly evolving data ecosystems, and for data practitioners to leverage the full value of data within their organization.
Here’s an overview of DataHub’s current functionality. Curious about what’s to come? Check out our roadmap.
Production is called as production for obvious reasons. So be certain before you apply any patch.
To keep you alert: https://www.blef.fr/data-deleted-from-production/