Just Enough Data Weekly Newsletter 14

Apr 02, 2023

Hello Everyone!

We are in the middle of the crisis and the job market is very difficult. With many layoffs happening all around the world I want to do my bit for the community.

Recently I have come across one of the coolest Interview guide to prepare yourself for your next challenge.

software-engineering-interview-guide

lakeFS 💛 DuckDB: Why is DuckDB all the rage in the Data Community?

https://www.meetup.com/lets-talk-data-sf/events/292111441/

DuckDB is coming to SF! Join us as we explore the possibilities of how lakeFS and DuckDB can improve your database management!

--------
🤝Organizer : DuckDB & lakeFS
📍Location: Trellis Co-working & Events (981 Mission St, San Francisco)
🍕Catering: Pizza & Drinks

Different choices To Orchestrate your Pipelines

You would have seen a lot of discussions happening around w.r.t the Alternative solution for the Airflow.

Calling the topic to be Alternative solution for Airflow would be wrong. Every business has its own requirements and business goals. We cant always fit a same solution for all the problems.

With that we can say that we can have multiple solutions for the same problem.

So have you heard about Mage.

Here is my medium blog talking about it.

Batch Processing vs Stream Processing

You heard it right. The discussion of Batch VS Stream is still not over.

One of the most important decisions organizations make when it comes to data processing is whether to use stream or batch processing. Stream processing is quickly becoming the go-to option for many companies because of its ability to provide real-time insights and immediate actionable results. With the right stream processing platform, companies can easily unlock the value of their data and use it to gain a competitive edge.This article will explore why stream processing is taking over, including its advantages over batch processing, such as its scalability, cost-effectiveness, and flexibility.

And here is the blog which talks more about it.

Medium Blog

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Link

Introducing Quix Streams, an open source library for telemetry data streaming

Lightweight, powerful, no JVM and no need for separate clusters of orchestrators. Here’s a look at our next-gen streaming library for C# and Python developers including feature summaries, code samples, and a sneak peek into our roadmap.

Link

Spark On AWS Lambda

Apache Spark on AWS Lambda is a standalone installation of Spark running on AWS Lambda. The Spark is packaged in a Docker container, and AWS Lambda is used to execute the image along with the PySpark script. Currently, heavier engines like Amazon EMR, AWS Glue, or Amazon EMR serverless are required for event driven or streaming smaller files to use Apache Spark. When processing smaller files under 10 MB in size per payload, these engines incur resource overhead costs and operate more slowly(slower than pandas). This container-based strategy lowers overhead costs associated with spinning up numerous nodes while processing the data on a single node. Use Apache Spark on AWS Lambda for event-based pipelines with smaller files if you're seeking for a less expensive choice, according to customers.

Link

Serverless Spark CI:CD on AWS with GitHub Actions

Interested in combining Amazon EMR Serverless and GitHub Actions to continuously test and deploy your PySpark code to AWS? In this session, Damon will talk about how to configure unit tests when a push is made, run quick data quality checks when a pull request is opened, and automatically package and deploy tagged releases to Amazon S3 for use in your Spark jobs.

github

Twitter has open-sourced tweet recommendation Algorithm

As repeatedly promised by Twitter CEO Elon Musk, Twitter has opened a portion of its source code to public inspection, including the algorithm it uses to recommend tweets in users’ timelines.

Link

github

twitter blog

You are the only one who would know your worth. you do not need a validation from anyone.

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Medium Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓