Snowflake Load Performance

Β· 702 words Β· 4 minute read

We are using the Singer target transferwise-target-snowflake for loading data into our warehouse. This works well, but is slow out of the box. So, I wanted to check out some config options. Out of the box, there are some variables I want to adjust, and a few others that I’ll define as outside the scope of this test, and will remain static.

The static factors πŸ”—

For this test, there are a few things that we won’t change:

  • We are loading data from local CSV files, using tap-spreadsheets-anywhere.
  • We are using a small (Standard DS1 v2 (1 vcpu, 3.5 GiB memory)) Azure VM running ubuntu.
  • We will only use the transferwise variant this time.

We have 20 CSV files each 250MB in size (very uniformely since these are synthetic), totalling 5GB and 5 242 880 lines. The files have 10 columns consisting of random text (looking kind of like md5 hashes).

Because of the way this test is done, we don’t have duplicates so any effect of updates vs inserts aren’t measured.

The Snowflake account and the VM is only inhabited by myself, so there is unlikely to be any foreign processes that randomly consumes resources and skews the results.

The possibly non-static factors πŸ”—

There are a number of things I want to check out the performance impact of:

  • Batch size: This tap uses a table stage, and by default each batch will be 100 000 rows. What happens if we increase this?
  • Warehouse size: I usually only run the smallest kind, will we save by upgrading?
  • Ingest file type: A new, experimental feature lets us use parquet file with an external stage. Rumor has it this has a noticeable effect.

Establishing a baseline πŸ”—

It isn’t feasible to run this baseline multiple times, it takes quite some time and I want a little volume here, so we will have to make do with high-volume with one iteration. I use the time function to measure total time, I have not set up a more elaborate test setup.

The baseline will be the default, out-of-the-box setup with a csv file stage, batch size of 100k, and an XS warehouse.

The iterations πŸ”—

Ideally we’d want to test a combination of 3 different things:

  • warehouse size (possible values from XS-6XL), due to cost we’ll do XS and M.
  • Batch size: We don’t have all that many rows in total, so we’ll go for 100k, 500k and 1M.
  • Stage file type: csv or parquet.

In total, this is 2 x 3 x 2 iterations, totalling 12. We might have to be a little selective.

File stage: πŸ”—

As mentioned we use the time command to time the runs. Time outputs real, user and sys time. The times reported are real times.

warehouse size \ Batch size100k500k1M

The 1M load failed on file #4, which is roughly where it would hit 1M rows and have to write.

Parquet stage: πŸ”—

It turns out that parquet stages are only available with AWS S3, and because I am not on AWS I will not do this one now.

warehouse size \ Batch size100k500k1M

Bonus: the meltanolabs variant πŸ”—

Meltanolabs has a different version of the target-snowflake loader, with some other ways of loading. It does not use table stages, so it does not have a batch_size option. But it does have a “batch” mode which requires a tap that can do batch as well, and… yes. That should be quick, but I don’t think I have a compatible tap that can read local CSV files.

Real time: 127m48.398s

Pretty much exactly twice as long as transferwise, in other words. Is there a sad trombone emoji around here somewhere?

Conclusion πŸ”—

Basically, adjusting the nobs on the transferwise version of target-snowflake or warehouse size won’t do much, but rumor has it that if you are on AWS, an external stage and the new experimental parquet option improves time.

And if performance is key, the meltanolabs variant is not for you. If you are able to do batch messages on the other hand… it will probably fly.