Exploring correlations in large datasets with SQL in Node.js

How quickly we can uncover interesting insights by executing analytics queries with the DuckDB embedded SQL engine, just using Node.js?

3 min readDec 30, 2020

Project Setup

Set up a new Node.js package:

$ mkdir nyc-taxi-correlations && cd nyc-taxi-correlations
$ yarn init -y

Add basic dependencies:

$ yarn add -D typescript ts-node nodemon

Add DuckDB, the star of the show, using DeepCrawl’s node-duckdb:

$ yarn add node-duckdb
✨  Done in 11.66s.

Download data files from TLC Trip Record Data page into a data directory, and start coding (using pre-installed VSCode):

$ mkdir data src
$ code .

Create the basic boiler plate in src/run.ts and start it:

$ yarn nodemon -e ts --exec ts-node src/run

Now, as we save changes to the code, it will re-run and output to the terminal.

Getting to Work

The Correlation Function

A quick search brought me back to the Pearson Correlation Coefficient, which we learned about in Stats 101. It can take a range between -1 and 1, where 0 indicates a weak correlation between the variables and -1 and 1 respectively indicates a strong negative or positive correlation.

I put together a TypeScript function for rendering the calculation:

const r = (x: string, y: string) => 
`covar_pop(${x}, ${y}) / (stddev_pop(${x}) * stddev_pop(${y}))`;

Used in our SELECT statement:

const result = await connection.executeIterator(
`SELECT count(),
   ${r("tip_amount", "trip_distance")} as tip_distance 
 FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')`);

Which results in:

{
  'count()': 7667792n,
  tip_distance: 0.539269014536253,
}

So, it seems there’s at least some positive correlation, albeit not particularly strong, between tip amount and trip distance. I then added in calculations for checking the correlation coefficient with fare (excl. tip) and with number of passengers, which clearly seems to have a very weak correlation:

...
tip_fare: 0.026611783378787134,
tip_passengers: 0.001088787150935159
...

Dataset Performance

We ran this on a single CSV file representing 7.7m trips in one month and it took about 5 seconds to run. What would happen if we add in more CSVs?

With one more CSV:

...
FROM (
  SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')
  UNION
  SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-02.csv')
) as trips

We get:

{
  'count()': 14687122n,
  tip_distance: 0.5327676857033936,
  tip_fare: 0.027202256434925805,
  tip_passengers: 0.0008883969400021263
}

That covers about 14.7m rows and 1.337GB of CSV data read and calculated in 17.5s, and the correlation coefficients (r-value) still look fairly consistent with what we’ve seen before.

Not the Whole Story

After adding 4 more CSVs, now covering January-July, we notice an inconsistency with previous r-values:

{
  'count()': 44459089n,
  tip_distance: 0.032890025637241575,
  tip_fare: 0.6334371687025606,
  tip_passengers: -0.0001533928725559738
}

Where did this difference come from? Perhaps it makes sense to group the data by month.

...
WHERE extract('month' FROM tpep_pickup_datetime) < 7
GROUP BY extract('month' FROM tpep_pickup_datetime)

(We had to add the WHERE clause because with groups we noticed that the dataset contains some odd data for other months too.)

This results in consistent r-values for January, February and April, but with bigger differences for March, June and July:

...
{
  month: 3n,
  'count()': 7832325n,
  tip_distance: 0.028248351449362007,
  tip_fare: 0.9713848900417407,
  tip_passengers: -0.0001662104560847138
},
...
{
  month: 5n,
  'count()': 7565359n,
  tip_distance: 0.5258277351357648,
  tip_fare: 0.5614612919769764,
  tip_passengers: -0.002357654989447674
},
{
  month: 6n,
  'count()': 6940628n,
  tip_distance: 0.11414505917858382,
  tip_fare: 0.06219687943806697,
  tip_passengers: -0.001546597640316774
}
...

What happened with the data in these months? Could tipping behaviour really have been so much different, with an r-value of nearly 1 for tipping by fare? I suggest we explore this further…

Code is available at https://github.com/jupiter/nyc-taxi-correlations/