# Exploring correlations in large datasets with SQL in Node.js

## How quickly we can uncover interesting insights by executing analytics queries with the DuckDB embedded SQL engine, just using Node.js?

# Project Setup

Set up a new Node.js package:

`$ mkdir nyc-taxi-correlations && cd nyc-taxi-correlations`

$ yarn init -y

Add basic dependencies:

`$ yarn add -D typescript ts-node nodemon`

Add DuckDB, the star of the show, using DeepCrawl’s node-duckdb:

`$ yarn add node-duckdb`

✨ Done in 11.66s.

Download data files from TLC Trip Record Data page into a data directory, and start coding (using pre-installed VSCode):

`$ mkdir data src`

$ code .

Create the basic boiler plate in `src/run.ts`

and start it:

`$ yarn nodemon -e ts --exec ts-node src/run`

Now, as we save changes to the code, it will re-run and output to the terminal.

# Getting to Work

## The Correlation Function

A quick search brought me back to the Pearson Correlation Coefficient, which we learned about in Stats 101. It can take a range between -1 and 1, where 0 indicates a weak correlation between the variables and -1 and 1 respectively indicates a strong negative or positive correlation.

I put together a TypeScript function for rendering the calculation:

`const r = (x: string, y: string) => `

`covar_pop(${x}, ${y}) / (stddev_pop(${x}) * stddev_pop(${y}))`;

Used in our `SELECT`

statement:

`const result = await connection.executeIterator(`

`SELECT count(),

${r("tip_amount", "trip_distance")} as tip_distance

FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')`);

Which results in:

`{`

'count()': 7667792n,

tip_distance: 0.539269014536253,

}

So, it seems there’s at least some positive correlation, albeit not particularly strong, between tip amount and trip distance. I then added in calculations for checking the correlation coefficient with fare (excl. tip) and with number of passengers, which clearly seems to have a very weak correlation:

`...`

tip_fare: 0.026611783378787134,

tip_passengers: 0.001088787150935159

...

## Dataset Performance

We ran this on a single CSV file representing 7.7m trips in one month and it took about 5 seconds to run. What would happen if we add in more CSVs?

With one more CSV:

`...`

FROM (

SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-01.csv')

UNION

SELECT * FROM read_csv_auto('data/yellow_tripdata_2019-02.csv')

) as trips

We get:

`{`

'count()': 14687122n,

tip_distance: 0.5327676857033936,

tip_fare: 0.027202256434925805,

tip_passengers: 0.0008883969400021263

}

That covers about 14.7m rows and 1.337GB of CSV data read and calculated in 17.5s, and the correlation coefficients (r-value) still look fairly consistent with what we’ve seen before.

## Not the Whole Story

After adding 4 more CSVs, now covering January-July, we notice an inconsistency with previous r-values:

`{`

'count()': 44459089n,

** tip_distance: 0.032890025637241575,**

tip_fare: 0.6334371687025606,

tip_passengers: -0.0001533928725559738

}

Where did this difference come from? Perhaps it makes sense to group the data by month.

`...`

WHERE extract('month' FROM tpep_pickup_datetime) < 7

GROUP BY extract('month' FROM tpep_pickup_datetime)

(We had to add the `WHERE`

clause because with groups we noticed that the dataset contains some odd data for other months too.)

This results in consistent r-values for January, February and April, but with bigger differences for March, June and July:

`...`

{

month: 3n,

'count()': 7832325n,

tip_distance: 0.028248351449362007,

tip_fare: 0.9713848900417407,

tip_passengers: -0.0001662104560847138

},

...

{

month: 5n,

'count()': 7565359n,

tip_distance: 0.5258277351357648,

tip_fare: 0.5614612919769764,

tip_passengers: -0.002357654989447674

},

{

month: 6n,

'count()': 6940628n,

tip_distance: 0.11414505917858382,

tip_fare: 0.06219687943806697,

tip_passengers: -0.001546597640316774

}

...

What happened with the data in these months? Could tipping behaviour really have been so much different, with an r-value of nearly 1 for tipping by fare? I suggest we explore this further…

Code is available at https://github.com/jupiter/nyc-taxi-correlations/