Exploring correlations in large datasets with SQL in Node.js

How quickly we can uncover interesting insights by executing analytics queries with the DuckDB embedded SQL engine, just using Node.js?

Pieter Raubenheimer
3 min readDec 30, 2020

Project Setup

Set up a new Node.js package:

$ mkdir nyc-taxi-correlations && cd nyc-taxi-correlations
$ yarn init -y

Add basic dependencies:

$ yarn add -D typescript ts-node nodemon

Add DuckDB, the star of the show, using DeepCrawl’s node-duckdb:

$ yarn add node-duckdb
✨ Done in 11.66s.

Download data files from TLC Trip Record Data page into a data directory, and start coding (using pre-installed VSCode):

$ mkdir data src
$ code .

Create the basic boiler plate in src/run.ts and start it:

$ yarn nodemon -e ts --exec ts-node src/run

Now, as we save changes to the code, it will re-run and output to the terminal.

Getting to Work

The Correlation Function

A quick search brought me back to the Pearson Correlation Coefficient, which we learned about in Stats 101. It can take a range between -1 and 1, where 0 indicates a weak correlation between the variables and -1 and 1 respectively indicates a strong negative or positive correlation.

I put together a TypeScript function for rendering the calculation:

const r = (x: string, y: string) => 
`covar_pop(${x}, ${y}) / (stddev_pop(${x}) * stddev_pop(${y}))`;

Used in our SELECT statement:

const result = await connection.executeIterator(
`SELECT count(),
${r("tip_amount", "trip_distance")} as tip_distance
FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')`);

Which results in:

{
'count()': 7667792n,
tip_distance: 0.539269014536253,
}

So, it seems there’s at least some positive correlation, albeit not particularly strong, between tip amount and trip distance. I then added in calculations for checking the correlation coefficient with fare (excl. tip) and with number of passengers, which clearly seems to have a very weak correlation:

...
tip_fare: 0.026611783378787134,
tip_passengers: 0.001088787150935159
...

Dataset Performance

We ran this on a single CSV file representing 7.7m trips in one month and it took about 5 seconds to run. What would happen if we add in more CSVs?

With one more CSV:

...
FROM (
SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')
UNION
SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-02.csv')
) as trips

We get:

{
'count()': 14687122n,
tip_distance: 0.5327676857033936,
tip_fare: 0.027202256434925805,
tip_passengers: 0.0008883969400021263
}

That covers about 14.7m rows and 1.337GB of CSV data read and calculated in 17.5s, and the correlation coefficients (r-value) still look fairly consistent with what we’ve seen before.

Not the Whole Story

After adding 4 more CSVs, now covering January-July, we notice an inconsistency with previous r-values:

{
'count()': 44459089n,
tip_distance: 0.032890025637241575,
tip_fare: 0.6334371687025606,

tip_passengers: -0.0001533928725559738
}

Where did this difference come from? Perhaps it makes sense to group the data by month.

...
WHERE extract('month' FROM tpep_pickup_datetime) < 7
GROUP BY extract('month' FROM tpep_pickup_datetime)

(We had to add the WHERE clause because with groups we noticed that the dataset contains some odd data for other months too.)

This results in consistent r-values for January, February and April, but with bigger differences for March, June and July:

...
{
month: 3n,
'count()': 7832325n,
tip_distance: 0.028248351449362007,
tip_fare: 0.9713848900417407,
tip_passengers: -0.0001662104560847138
},
...
{
month: 5n,
'count()': 7565359n,
tip_distance: 0.5258277351357648,
tip_fare: 0.5614612919769764,
tip_passengers: -0.002357654989447674
},
{
month: 6n,
'count()': 6940628n,
tip_distance: 0.11414505917858382,
tip_fare: 0.06219687943806697,
tip_passengers: -0.001546597640316774
}
...

What happened with the data in these months? Could tipping behaviour really have been so much different, with an r-value of nearly 1 for tipping by fare? I suggest we explore this further…

Code is available at https://github.com/jupiter/nyc-taxi-correlations/

--

--