Duckit

I’ve been throughly duckpilled. I’ve spent the better part of the last year wrangling a LOT of CSVs and let me tell you, duckdb has been right there with me, helping me along the way.

man running from his house yelling 'AHHH IM INSANE WITH ANGER
me, talking about duckdb to anyone who will listen

And after a year of doing this DAY AFTER DAY AFTER DAY

❯ duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table t as select * from read_csv('ome_more_csv');

I decided that enough was enough and I made DUCKIT.

BEHOLD MY MONSTROUS CREATION!

duckit() {
  csv=$1

  tmp_db=$(mktemp).db
  cat $1 | duckdb $tmp_db "create table t as select * from read_csv('/dev/stdin')" && duckdb $tmp_db
}

Does it clean up after itself? Of course not!

Does it check inputs? Psh.

What happens if duckdb isn’t installed? You’ll probably die who can tell?

But it lets me do this, which I’ve been dying to do for months now:

❯ duckit free_company_dataset.csv
100% ▕████████████████████████████████████████████████████████████▏ 
v1.1.3 19864453f7
Enter ".help" for usage hints.
D summarize t;
┌──────────────┬─────────────┬─────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────┬───────────────┬────────────────────┬────────────────────┬─────────┬─────────┬─────────┬──────────┬─────────────────┐
│ column_name  │ column_type │                                 min                                 │                                      max                                       │ approx_unique │        avg         │        std         │   q25   │   q50   │   q75   │  count   │ null_percentage │
│   varchar    │   varchar   │                               varchar                               │                                    varchar                                     │     int64     │      varchar       │      varchar       │ varchar │ varchar │ varchar │  int64   │  decimal(9,2)   │
├──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────┼───────────────┼────────────────────┼────────────────────┼─────────┼─────────┼─────────┼──────────┼─────────────────┤
│ country      │ VARCHAR     │ afghanistan                                                         │ åland islands                                                                  │           283 │                    │                    │         │         │         │ 22734322 │            9.98 │
│ founded      │ BIGINT      │ 10012024937 │ 2006.3161400677536 │ 24.825956869579304 │ 20032013201822734322 │           60.79 │
│ id           │ VARCHAR     │ 000053ktcMRCSIbGGboE0QlBPQlh                                        │ zzzzgepytuhECxjtjLcHyAGn554K                                                   │      22169331 │                    │                    │         │         │         │ 22734322 │            0.00 │
│ industry     │ VARCHAR     │ "glass                                                              │ writing and editing                                                            │           164 │                    │                    │         │         │         │ 22734322 │           23.83 │
│ linkedin_url │ VARCHAR     │  ceramics & concrete"                                               │ linkedin.com/company/   ashiraj-education-overseas-consultants-private-limited │      26144844 │                    │                    │         │         │         │ 22734322 │            0.00 │
│ locality     │ VARCHAR     │ "aeroporto \"b\""                                                   │ ’aïn el melh                                                                   │        248569 │                    │                    │         │         │         │ 22734322 │           32.66 │
│ name         │ VARCHAR     │  el"                                                                │ 🫧sl-wash🫧 spécialiste station de lavage auto                                 │      24204406 │                    │                    │         │         │         │ 22734322 │            0.13 │
│ region       │ VARCHAR     │                                 registered estate surveyors & val…  │ 🧚♀️maison sérénité opôno  🪷                                                   │        139478 │                    │                    │         │         │         │ 22734322 │           22.99 │
│ size         │ VARCHAR     │ 1-10                                                                │ žilinský                                                                       │          2558 │                    │                    │         │         │         │ 22734322 │            0.38 │
│ website      │ VARCHAR     │ "google.com/maps/place/28°36'36.4\"n+77°01'43.6\"e/@28.610122.77.…  │ 👁👄👁.fm                                                                        │      20771265 │                    │                    │         │         │         │ 22734322 │           30.49 │
├──────────────┴─────────────┴─────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────────────────┴────────────────────┴─────────┴─────────┴─────────┴──────────┴─────────────────┤
│ 10 rows                                                                                                                                                                                                                                                                                     12 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D select count(0) from t;
┌──────────┐
│ count(0) │
│  int64   │
├──────────┤
│ 22734322 │
└──────────┘

note: don’t scroll all the way to the right, the table got all janked

It only took me 20 years

… to understand how xargs works.

Well, it’s pretty presumptuous of me to say that I understand how xargs works. More accurately, I have a good enough understanding of xargs to be able to use it without immediately running to Google or ChatGPT. Hopefully, dear reader, this will help.

Let’s say I wanna upload a bunch of files that match a regular expression to S3. These were the files I wanted to upload:

❯ ls -al successes-*
-rw-r--r-- 1 mando staff 289743 Jan 16 17:53 successes-00.csv
-rw-r--r-- 1 mando staff 290386 Jan 16 17:52 successes-01.csv
-rw-r--r-- 1 mando staff 287589 Jan 16 17:53 successes-02.csv
-rw-r--r-- 1 mando staff 291275 Jan 16 17:53 successes-03.csv
-rw-r--r-- 1 mando staff 290009 Jan 16 17:53 successes-04.csv
-rw-r--r-- 1 mando staff 149875 Jan 16 17:53 successes-05.csv

Now you’re probably gonna tell me about the --include and --exclude flags to aws s3 cp to which I’ll say “phooey!”.

translation: I tried it and couldn't make it work

Enter: xargs

Here’s the thing I never got that’s helped me better understand what’s going on: all xargs does is take whatever you send it and append them to the end of the command you tell it to call. That’s a terrible explanation but maybe an example will help:

❯ ls success* | xargs echo
successes-00.csv successes-01.csv successes-02.csv successes-03.csv successes-04.csv successes-05.csv successes.csv

All it’s doing is taking the output from ls success* and sending it to echo, turning the command into:

❯ echo successes-00.csv  successes-01.csv  successes-02.csv  successes-03.csv  successes-04.csv  successes-05.csv  successes.csv
successes-00.csv successes-01.csv successes-02.csv successes-03.csv successes-04.csv successes-05.csv successes.csv

Useful I suppose, but doesn’t do a lot for me at the moment. But in all the examples for xargs people be saying -I a lot - what does that do?

❯ ls success* | xargs -I echo
successes-00.csv
successes-01.csv
successes-02.csv
successes-03.csv
successes-04.csv
successes-05.csv
successes.csv

AHA! Instead of taking the inputs and turning them into a list of arguments to the xargs command (echo in this case), it iterates over them one at a time and calls the command with each input in turn. If you toss in -p (for prompt) you’ll see it even clearer:

❯ ls success* | xargs -p -I echo
/bin/echo successes-00.csv?...y
successes-00.csv
/bin/echo successes-01.csv?...y
successes-01.csv
/bin/echo successes-02.csv?...

So let’s try it with aws s3 cp:

❯ ls success* | xargs -p -I aws s3 cp ????????? s3://brainstorm-data/dev/mando/

Astute readers will see the problem: we gotta get the filename where the ????????? is, not at the end of the command like we did with echo.

But fear not! Our new best friend -I is here to help! We can pass some placeholder text to -I and then use that placeholder text in the command to substitute in the argument.

Again, that’s some terrible words but hopefully an example will help:

❯ ls success* | xargs -p -I {} aws s3 cp {} s3://brainstorm-data/dev/mando/
aws s3 cp successes-00.csv s3://brainstorm-data/dev/mando/?...y
upload: ./successes-00.csv to s3://brainstorm-data/dev/mando/successes-00.csv
aws s3 cp successes-01.csv s3://brainstorm-data/dev/mando/?...

When we call xargs like this:

xargs -p -I {} aws s3 cp {} s3://brainstorm-data/dev/mando/

It takes the input (in this case our filename) and replaces the {} with the input! Add in -p as a sanity check and review before your computer friend actually runs the command, and now you’re xargsing.

dfimage and you

Every so often, I’ll find myself with a docker image and no accompanying Dockerfile. When this happens, I usually just panic and try to judo my way around the problem (which is often that the docker container needs to be updated or modified).

Not anymore, friends!

Enter - github.com/LanikSJ/d…

Check the README, but basically you give it a docker image id and it’ll do a pretty decent job of providing a Dockerfile that’ll generate said image. It’s not perfect, but it’ll give you a good enough starting point that probably won’t ruin everything.

👨🏽‍💻