Apr 20, 2023

Building Better Docs - Automating Jekyll Builds and Link Checking for PRs

One of the most important ways that a project can help its developers is providing them good documentation. Actually, scratch that. Great documentation.

Apr 5, 2023

Using Delta from pySpark - `java.lang.ClassNotFoundException: delta.DefaultSource`

No great insights in this post, just something for folk who Google this error after me and don’t want to waste three hours chasing their tails… 😄

Mar 14, 2023

Quickly Convert CSV to Parquet with DuckDB

Here’s a neat little trick you can use with DuckDB to convert a CSV file into a Parquet file:

COPY (SELECT *
	    FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE))
  TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

Mar 3, 2023

Making the move from Alfred to Raycast

It all started with a tweet.

Mar 3, 2023

Aligning mismatched Parquet schemas in DuckDB

What do you do when you want to query over multiple parquet files but the schemas don’t quite line up? Let’s find out 👇🏻

Dec 9, 2022

Looking Forwards, and Looking Backwards

As we enter December and 2022 draws to a close, so does a significant chapter in my working career—later this month I’ll be leaving Confluent and onto pastures new.

It’s nearly six years since I wrote a 'moving on' blog entry, and as well as sharing what I’ll be working on next (and why), I also want to reflect on how much I’ve benefited from my time at Confluent and particularly the people with whom I worked.

Nov 8, 2022

Data Engineering in 2022: ELT tools

In my quest to bring myself up to date with where the data & analytics engineering world is at nowadays, I’m going to build on my exploration of the storage and access technologies and look at the tools we use for loading and transforming data.

Oct 24, 2022

Data Engineering in 2022: Wrangling the feedback data from Current 22 with dbt

I started my dbt journey by poking and pulling at the pre-built jaffle_shop demo running with DuckDB as its data store. Now I want to see if I can put it to use myself to wrangle the session feedback data that came in from Current 2022. I’ve analysed this already, but it struck me that a particular part of it would benefit from some tidying up - and be a good excuse to see what it’s like using dbt to do so.

Oct 20, 2022

Data Engineering in 2022: Exploring dbt with DuckDB

I’ve been wanting to try out dbt for some time now, and a recent long-haul flight seemed like the obvious opportunity to do so. Except many of the tutorials with dbt that I found were based on using Cloud, and airplane WiFi is generally sucky or non-existant. Then I found the DuckDB-based demo of dbt, which seemed to fit the bill (🦆 geddit?!) perfectly, since DuckDB runs locally. In addition, DuckDB had appeared on my radar recently and I was keen to check it out.

Oct 14, 2022

Current 22 - Session Analysis with DuckDB and Jupyter Notebook

At Current 2022 the audience was given the option to submit ratings. Here’s some analysis I’ve done on the raw data. It’s interesting to poke about it, and it also gave me an excuse to try using DuckDB in a notebook!

Oct 2, 2022

Data Engineering in 2022: Architectures & Terminology

This is one of those you had to be there moments. If you come into the world of data and analytics engineering today, ELT is just what it is and is pretty much universally understood. But if you’ve been around for …waves hands… longer than that, you might be confused by what people are calling ELT and ETL. Well, I was ✋.

Sep 26, 2022

Current 2022 - 5k Fun Run

At Current 22 a few of us will be going for an early run on Tuesday morning. Everyone is very welcome!

Sep 16, 2022

Data Engineering in 2022: Exploring LakeFS with Jupyter and PySpark

With my foray into the current world of data engineering I wanted to get my hands dirty with some of the tools and technologies I’d been reading about. The vehicle for this was trying to understand more about LakeFS, but along the way dabbling with PySpark and S3 (MinIO) too.

I’d forgotten how amazingly useful notebooks are. It’s six years since I wrote about them last (and the last time I tried my hand at PySpark). This blog is basically the notebook, with some more annotations.

Sep 14, 2022

Data Engineering: Resources

As I’ve been reading and exploring the current world of data engineering I’ve been adding links to my Raindrop.io collection, so check that out. In addition, below are some specific resources that I’d recommend.

Sep 14, 2022

Data Engineering in 2022: Storage and Access

In this article I look at where we store our analytical data, how we organise it, and how we enable access to it. I’m considering here potentially large volumes of data for access throughout an organisation. I’m not looking at data stores that are used for specific purposes (caches, low-latency analytics, graph etc).

The article is part of a series in which I explore the world of data engineering in 2022 and how it has changed from when I started my career in data warehousing 20+ years ago. Read the introduction for more context and background.

Sep 14, 2022

Stretching my Legs in the Data Engineering Ecosystem in 2022

For the past 5.5 years I’ve been head-down in the exciting area of stream processing and events, and I realised recently that the world of data and analytics that I worked in up to 2017 which was changing significantly back then (Big Data, y’all!) has evolved and, dare I say it, matured somewhat - and I’ve not necessarily kept up with it. In this series of posts you can follow along as I start to reacquaint myself with where it’s got to these days.

Sep 12, 2022

Customising the fields shown in Airtable’s Calendar .ics export

Airtable is a rather wonderful tool. It powers the program creation backend process for Kafka Summit and Current. It does, however, have a few frustrating limitations - often where it feels like a feature was built on a Friday afternoon and they didn’t get chance to finish it before knocking off to head to the pub.

Aug 31, 2022

Inside the Sausage Factory: How we Built the Program for Current 2022

If you’ve ever been to a conference, particularly as a speaker whose submitted a paper that may or may not have been accepted, you might wonder quite how conferences choose the talks that get accepted.

I had the privilege of chairing the program committee for Current and Kafka Summit this year and curating the final program for both. Here’s a glimpse behind the curtains of how we built the program for Current 2022. It was originally posted as a thread on Twitter.

Aug 31, 2022

⚡️ Writing an abstract for a lightning talk ⚡️

(src)

Lightning talks are generally 5-10 minutes. As the name implies - they are quick!

A good lightning talk is not just your breakout talk condensed into a shorter time frame. You can’t simply deliver the same material faster, or the same material at a higher level, or the same material with a few bits left out

Jul 20, 2022

How to Write a Good Tech Conference Abstract - Learn from the Mistakes of Others

Building the program for any conference is not an easy task. There will always be a speaker disappointed that their talk didn’t get in—or perhaps an audience who are disappointed that a particular talk did get in. As the chair of the program committee for Current 22 one of the things that I’ve found really useful in building out the program this time round are the comments that the program committee left against submissions as they reviewed them.

There were some common patterns I saw, and I thought it would be useful to share these here. Perhaps you’re an aspiring conference speaker looking to understand what mistakes to avoid. Maybe you’re an existing speaker whose abstracts don’t get accepted as often as you’d like. Or perhaps you’re just curious as to what goes on behind the curtains :)

rmoff’s random ramblings

✨ Data Engineering, Kafka, and other random geekery 🤓

Building Better Docs - Automating Jekyll Builds and Link Checking for PRs

Using Delta from pySpark - `java.lang.ClassNotFoundException: delta.DefaultSource`

Quickly Convert CSV to Parquet with DuckDB

Making the move from Alfred to Raycast

Aligning mismatched Parquet schemas in DuckDB

Looking Forwards, and Looking Backwards

Data Engineering in 2022: ELT tools

Data Engineering in 2022: Wrangling the feedback data from Current 22 with dbt

Data Engineering in 2022: Exploring dbt with DuckDB

Current 22 - Session Analysis with DuckDB and Jupyter Notebook

Data Engineering in 2022: Architectures & Terminology

Current 2022 - 5k Fun Run

Data Engineering in 2022: Exploring LakeFS with Jupyter and PySpark

Data Engineering: Resources

Data Engineering in 2022: Storage and Access

Stretching my Legs in the Data Engineering Ecosystem in 2022

Customising the fields shown in Airtable’s Calendar .ics export

Inside the Sausage Factory: How we Built the Program for Current 2022

⚡️ Writing an abstract for a lightning talk ⚡️

How to Write a Good Tech Conference Abstract - Learn from the Mistakes of Others