Quantcast
Channel: csv | ThatJeffSmith
Viewing all articles
Browse latest Browse all 27

Hours wasted watching Grey’s Anatomy? Finding out with SQL!

$
0
0

I have kids. Our family has had a family Netflix account since 2017. My daughter decided to jump into 18 Seasons of Grey’s Anatomy last year, and has been binging it ever since.

So when I put together a post last week talking about using SQL to analyze Spotify streaming data, my brain quietly whispered to itself, I bet Netflix data would be more fun.

So let’s do it. Let’s move this data into a database, and use SQL to figure out how much time she’s invested in Dr. Meredith Grey, and her friends.

More fun examples, using your own data to learn SQL, build REST APIs, etc.

Untappd || iTunes || Twitter || Strava || Netflix || Spotify


Step 0: Download your Netflix data.

You request it from your online account. Do that in your browser, not on your Netflix app.

If you know where to look, it’s easy to find.

You’ll have to confirm the request. They say it will take up to 30 days. It took only 1 day for me.

Then you’ll get an email saying it’s ready. You’ll following the link, provide your password (again), and download a Zip file.

They provide A LOT of your personal data. From what I can tell, ALL of it. Kudos, Netflix!

We’re going to import that ‘ViewingActivity.csv’ file as a new table in an Oracle Database, so you’ll need one of those. I’ve talked about that before, a lot.

If you don’t have SQL Developer Web, if you DO have an Oracle Database and a copy of Oracle SQL Developer, I’ve shown how to do this same table import here.

Step 1: Import your data

I’m going to use my Always Free Oracle Autonomous Database, and the Database Actions (SQL Developer Web!) interface in my browser to do it.

This will open a wizard, and on the 2nd page, you can define the columns. I’ve left the data type defaults, but have changed the precision of the text fields.

We’re going to treat these ‘duration’ bits of data as INTERVAL types, later.

Note that the wizard has recognized that ‘START_TIME’ is a date, and it’s recommending we bring it in as a TIMESTAMP, and has even recognized the proper format mask to read those values in.

We can click to the end of the wizard, and hit ‘Finish.’

When it’s done processing my 2.5MB of data, I can see it’s imported more than 14,000 rows. Is that sad? I don’t know, the four of us watch a TON of online content.

I simply added this data as a table called ‘NETFLIX’ –

It’s the entire history, not just the last 365 days that Spotify gave me for my request.

Step 2: Start doing SQL

Again, kudos to Netflix for including a ‘data dictionary’ in their personal data package. It’s quite nice –

I’ve highlighted the bits of data I’m going to need in my SQL to answer my ‘question.’
  • Profile – my daughter
  • Duration – how long did she spend watching it
  • Title – anything Grey’s Anatomy
  • Supplemental Video Type – just episodes, not teasers, trailers, bonus content, etc

Starting Simpler, what have I seen recently?

I know MY data, so let’s just make sure this CSV dump is accurate. So I’m going to ask what I’ve watched recently.

Yup, this sounds about right.
SELECT
    START_TIME,
    DURATION,
    TITLE
FROM
    NETFLIX
WHERE
        PROFILE_NAME = 'Jefferson'
    AND SUPPLEMENTAL_VIDEO_TYPE IS NULL
ORDER BY
    START_TIME DESC

DURATION was brought in as a VARCHAR2, or a string. So doing MATHs on this column later will get a bit tricksy. However, the results seem ‘right’ to me. I gave up on re-watching the original Fletch about 30 minutes into it. I just wasn’t feeling it. So yeah, it looks good.

Let’s find my daughter’s ‘raw’ data for Grey’s

This is already really, really fun data. Disclaimer: this is totally a coincidence.
SELECT 
      START_TIME,
      DURATION,
      TITLE
FROM
    NETFLIX
WHERE
        PROFILE_NAME = 'Daughter'
    AND SUPPLEMENTAL_VIDEO_TYPE IS NULL
    AND TITLE LIKE 'Grey''s%';

The only tricky thing here is the TITLE predicate clause. Grey’s Anatomy has an apostrophe, and that’s also the character we use to enclose strings in Oracle SQL. So the ‘trick’ is you have to escape it. One way is to just use an extra quote.

Tip: Perhaps better way to deal with this is to use our built-in function q[].
Docs Link.
LiveLabs Tutorial.

This is very handy, especially when you have a mix of single and double quotes.
SELECT 
      START_TIME,
      DURATION,
      TITLE
FROM
    NETFLIX
WHERE
        PROFILE_NAME = 'Daughter'
    AND SUPPLEMENTAL_VIDEO_TYPE IS NULL
    AND TITLE LIKE q'[Grey's Anatomy%]');

OK, now how do I sum up the duration?

That string is representing what Oracle basically considers an INTERVAL YEAR TO SECOND piece of data. Except in this case it’s only HOURS:MINUTES:SECONDS.

So what I’m going to do is create a VIRTUAL COLUMN that uses this data to create an actual INTERVAL interpretation of this data.

Let’s walk and chew some gum…add the VIRTUAL COLUMN and query the ‘new’ data.

We’re not physically storing new information here. VIRTUAL COLUMNs in an Oracle table allow for that data to be derived from other values in the row.

In this case we have

DURATION 0:13:3

and we want

DURATION 00 0:13:3

ALTER TABLE NETFLIX ADD 
    ( 
     DURATION_INTERVAL INTERVAL DAY TO SECOND
        generated always AS
          (
              TO_DSINTERVAL('00 ' || duration)
          ) virtual
    );

The TYPE of the new VIRTUAL COLUMN is ‘INTERVAL DAY TO SECOND.’ And the way it’s being derived is taking the value of the DURATION column and prefixing it with ’00 ‘

Then with that string being ‘computed’, I’m then sending that as an input to the TO_DSINTERVAL function, which returns an INTERVAL value.

Ok, so now I can just do a SUM on DURATION_INTERVAL, right?

Well, not quite. Unfortunately the database doesn’t give us a SUM() function for the INTERVAL type. So we need to do something else.

From the INTERVAL value, we’re going to extract the total number of minutes. And from that value, we’re going to total that up, and divide by ’60’ to get the total number of hours.

So, ~ 200 hours, 20 minutes.

I asked my wife, does that sound about right? And yes, about 8 straight days of binging over the last 18 months sounded about right.

Ok, let’s look at that code, esp the the SELECT

-- Hours spent watching Grey's Anatomy
SELECT
    SUM(EXTRACT(MINUTE FROM duration_interval))/60 time_wasted
    --title, duration_interval
FROM
    NETFLIX
WHERE
        PROFILE_NAME = 'Daughter'
    AND SUPPLEMENTAL_VIDEO_TYPE IS NULL
    AND TITLE LIKE 'Grey''s%';

It’s always easier to break the nested function calls into the various pieces, so let’s do that.

  • extract(minute from duration_interval) – pull out the minutes portion of time
  • sum sum(extract…) – add this number up
  • /60 – take the sum and divide to get ‘hours’

Wait a second, those pesky seconds really add up, right?

Now, simply writing this post and the description has forced me into a different level of thinking. What about the SECONDS? Over 666 viewings, surely the fractional number of minutes will be significant?

Let’s find out.

Crap, that’s an EXTRA 5 hours.

So if I ignored the seconds, I’d be off ~2.5% in my answer. Is that significant to worry about? Probably not. But you need to KNOW what you’re asking, when you’re writing the SQL. Assumptions DIG HUGE HOLES that you might have to crawl out of later.

So to answer my original question, my daughter has spent ~ 205 hours watching Grey’s Anatomy since May 22nd, 2022.

Kudos to my fellow product managers here on the Database team at Oracle. They helped me a bit with this SQL, especially Michelle with the Extract function().

Connor reminds me that this gets much easier if you take advantage of SQL Macros (introduced in 21c) and even EASIER EASIER once 23c is released.

Disclaimer: I’ve probably spent at least 50 hours watching this show with her. Enough to know the writing got EXTREMELY lazy when it came to killing off the main cast of characters.

Disclaimer 2: if you’re going to leave a nasty comment about my kid’s taste in pop culture entertainment or my parenting skills, just don’t. Feel free to bash my data modeling and SQL skills, as always.

DDL for my TABLE

CREATE TABLE NETFLIX
   (
    PROFILE_NAME VARCHAR2(10), 
    START_TIME TIMESTAMP (6), 
    DURATION VARCHAR2(20), 
    ATTRIBUTES VARCHAR2(256), 
    TITLE VARCHAR2(256), 
    SUPPLEMENTAL_VIDEO_TYPE VARCHAR2(256), 
    DEVICE_TYPE VARCHAR2(256), 
    BOOKMARK VARCHAR2(20), 
    LATEST_BOOKMARK VARCHAR2(20), 
    COUNTRY VARCHAR2(256), 
    DURATION_INTERVAL INTERVAL DAY (2) TO SECOND (6) GENERATED ALWAYS AS (TO_DSINTERVAL('00 '||"DURATION")) VIRTUAL 
   ) ;

Working with Oracle SQL and Temporal Data, The Movie

Watch this, it’s great. Chris is our SQL evangelist, and he’s been working with Connor to answer these types of questions for years.

Pure slice of gold.

The post Hours wasted watching Grey’s Anatomy? Finding out with SQL! first appeared on ThatJeffSmith.


Viewing all articles
Browse latest Browse all 27

Trending Articles