AI Methods Blog

As part of the third wave of the Canadian branch of the JRP project, we’re figuring out what types of AI tools might help us automate the data capture of the Canadian news websites included in our sites of study.

And we’re going to share the development of our process, our plan, and our code right here on this blog. Ideally, we might even be able to create a network of people working on similar projects who can all support each other.

Our goal is to capture a specific section of the homepages of our sites of study at specific times of day on each of our 14 data collection days in 2026, and automatically populate a Google Drive folder with those homepage captures. We also want to automatically generate PDFs with working links for all of the news stories those homepages display, also loaded into a Google folder (these stories will be the ones we use for content analysis). And, then, we want to generate a spreadsheet that automatically populates with the headlines, authors/bylines, and whether or not there is any mention of the use of AI in all of the news stories that are captured.

Our lead developer/code writer is Nujaimah Ahmed, who is majoring in Computer Engineering with a specialization in Artificial Intelligence at TMU – most of what you read below comes from her. But there will also be input from her project supervisor, TMU prof Ang Misri, and less technical/more process-related input from Dr. Nicole Blanchett, also a prof at TMU, who wrote this introduction.

The blog is linear – so what happened first will always be at the top, and you’ll have to scroll down to see the latest iterations of the AI tools being developed/tested/used.

Follow along on GitHub and with our documentation of our process below, and let us know if you have a suggestion or question!

PROCESS UPDATE OCTOBER 20, 2025

First Attempts at CBC Capture

Coded a Python script that captures a screenshot of the CBC news page and saves it to a local folder directory/google drive folder
Tested same process using already existing automated tool (Browse AI):
- Implements the same program has a UI element to visualize what is happening during capture process
- Can only capture entire page, visible part of page, or certain selections on the page
- Each capture costs 1 credit
  - 50 monthly credits for free
  - Different subscription tiers available if more credits are needed

	Python Script	Browse AI
Pros	Can play around more with what is to be captured, until how far in the page, what components to be eliminated, etc.	No code, do not need to run a script Centralized (all captures can be done on one account and will be saved under that account)UI component with a robot assistant that can visually show what page is being loaded and captured
Cons	Need to run the script Could run into some automation limitations	Can’t bound until where you want the screen capture to go up to Can do 50 captures max per month

Update Log:

Attempt	Problem	Solution
1	Errors from website protections not allowing screen captures in “headless” mode	Adjusted browser settings to work around them without a “headless” browser
2	Only the desktop view of the news page was being captured (the visible part)	Change settings to capture the full scrollable page
3	The full page was being captured, including “My Local” and ads	Find location of “My Local” on page and create page boundaries
4	The sidebar of “Popular Now” topics wasn’t showing up	Wait longer for page to fully load, and make the image capture area wider
5	.png file was being captured instead of .pdf	Implement same script for a pdf capture
6	Browser opens on computer during each capture due to issues with running “headless” mode (CBC detects automated browsing and blocks loading of content)	Keep “headless” mode off but minimize the window immediately after launching so the browser runs in the backgroundorRun script on a virtual machine
7	pdf capture, contains full content including “My Local” and ads(Viewport crop function only available for png capture not pdf)	Manually need to delete the pages that are not required
8	Google Cloud is restricted for TMU accounts, making APIs unusable	Use a personal non-TMU account to store the captures

Final Workflow:

Launch Chromium browser
Create an incognito tab
Navigate to https://www.cbc.ca/news and wait for page to load
Find location of “My Local” on news page and create boundaries for capture
Capture web page as png and pdf file
Save files to a Google Drive folder “cbc-captures”

Things done:

Screenshot of news page up until “My Local” section
Incognito mode
PDF and PNG capture

Things to do next:

Automate for daily captures without having to constantly run the script
- Automation is different for a Windows and Mac and needs to be set up on each device
  - Windows: Task Scheduler, Mac: macOS Launch Agents
~~Save images to a Google Drive~~
- Restricted access to Google Cloud on TMU account (unable to use Google Drive API) -> need to use on a personal account
- All images saved to a personal Google Drive folder “cbc-capture”
~~Do headless capturing (no browser opens during screen capture – happening in the background)~~
- Headless capturing not possible due to CBC restrictions for automation tasks (block users from accessing data using automated browsers) -> automate minimizing of browser immediately after launch and automatically close after completion
~~PDF instead of png~~
- PDF capturing full page instead of until “My Local” -> needs manual deletion of a few pages
- PDF capture takes more time than png (approx. 2 minutes longer)
- Kept both options for now
Create a code-free automated solution: cloud web platform for any user to easily access (don’t need the script to run on your own computer)

Ideas:

Central google account for captures only
Web platform hosted on cloud

Screenshot Example:

A screenshot of the cbc.ca/news homepage.

Newspaper Python package

Multi-threaded article download framework
News url identification
Text extraction from html
Top image extraction from html
All image extraction from html
Author extraction from text
Google trending terms extraction
Works in 10+ languages (English, Chinese, German, Arabic, …)
NLP feature for keyword and summary extraction

Tests:

Extract article urls on main page
Extract the category that each article belongs to
Extract title of each article
Author and published dae errors due to CBC news article metadata

Example Output:

Title: Canada’s bet on an AI boom

Authors: [‘Cbc News’]

Published Date: None

—

Title: Democracies must remember their own values when tackling borders, says U.S. official

Authors: [‘Cbc Radio’]

Published Date: None

—

Title: Think Stonehenge rocks? Ken Follett’s new novel is for you

Authors: [‘Cbc Books’]

Published Date: None

—

Title: None

Authors: []

Published Date: None

—

Title: Mad at Dad by Janie Hao wins CBC Kids Reads 2025

Authors: [‘Cbc Books’]

Published Date: None

—

Title: Restaurants are bringing the heat as spicy dishes attract trend-chasing customers

Authors: []

Published Date: None

—

Title: Are you polychronic or monochronic? Struggling to manage your time could be due to your ‘time personality’

Authors: [‘Catherine Zhu Is A Writer’, ‘Associate Producer For Cbc Radio. Her Reporting Interests Include Science’, ‘Arts’, ‘Culture’, “Social Justice. She Holds A Master’S Degree In Journalism The University Of British Columbia. You Can Reach Her At Catherine.Zhu Cbc.Ca.”]

Published Date: None

—

Title: What was girlhood like in the early 2000s? Read these graphic memoirs to find out

Authors: [‘Bridget Raymundo Is A Multimedia Journalist’, ‘Producer Currently Working At Cbc Books. You Can Reach Her At Bridget.Raymundo Cbc.Ca’]

Published Date: None

—

Title: Librarian fired after refusing to censor 2SLGBTQ+ books wins $700K US settlement

Authors: []

Published Date: None

—

Title: 50 Years of Quirks & Quarks and half a century of science

Authors: [“Bob Mcdonald Is The Host Of Cbc Radio’S Award-Winning Weekly Science Program”, ‘Quirks’, ‘Quarks. He Is Also A Science Commentator For Cbc News Network’, “Cbc Tv’S The National. He Has Received Honorary Degrees”, ‘Is An Officer Of The Order Of Canada.’, “Bob Mcdonald’S Recent Columns”]

Published Date: None

PROCESS UPDATE OCTOBER 30, 2025

Individual Story Capture and Spreadsheet Generation

So, this week we ran into a few stumbling blocks with our AI story-capture tests.

Our goal was to capture PDFs with live links, including video and audio links. We were able to make this work on the homepage PDFs, but not the individual stories. Text links work on the individual stories – but not the links to video and audio clips.

When Nujaimah checked the HTML of the story webpages, it turned out they didn’t have links to the video embedded within the actual article, versus in the homepage where there are embedded links. Instead, the individual story pages have reference links for video and audio – which are not clickable in the exported PDFs of individual stories. To get around this, Nujaimah plans to write/run a script that goes through all the articles on the homepage, captures all playback video links, and then copies them into the spreadsheet.

The goal is to have them populate the same spreadsheet that will automatically capture headlines and authors of stories. If that works out, the video/audio links will automatically be matched up with the appropriate story in a column/row on the spreadsheet.

And there WAS some good news on the spreadsheet front. Nujaimah’s code generated a fantastic looking spreadsheet with a test capture. You can see a screen grab of it in the image below.

Picture of a Google Spreadsheet with columns that list the title, author, link, and date posted of stories from a test capture.

Other columns that might be added into this include any email addresses or social handles given for the reporter; whether the story was created by a third party/news agency; and if there is any indication AI was used in the development/writing of the story.

For more details about Nujaimah’s process read her workflow below and check out her latest code on Github.

Nujaimah’s Workflow for Individual Story Capture and Automated Generation of a Google Spreadsheet

Story Capture

Capture Process:

Incorporate a loop in the original homepage script that gathers the links of all the stories on the homepage
Iterate through each individual story page and capture a PDF screenshot
Save captures to a Google Drive folder

Story Data Process:

During the capture of each individual story gather the title, author, date published, and links to audios/videos of the story
Copy data and paste to a Google Sheet for each story separated by rows

Update Log:

Attempt	Problem	Solution
1	PDF capture not capturing links of audio/videos included in individual stories	Iterate through each individual story and locate all links to playback audios/videos on HTML and copy to spreadsheet matching with the story link
2	Newspaper python package not extracting authors (Author name is not included as part of the CBC metadata)	Individually extract story data by locating HTML tags used for authors in the web page
3	Third party authors not being extracted (e.g. Associated Press, CBC News, Thomson Reuters)	In progress

Updated Workflow (for homepage and individual story captures):

Launch Chromium browser
Create incognito tab
Navigate to https://www.cbc.ca/news and wait for page to load
Capture pdf screenshot of homepage
Save screenshot to a Google Drive folder
Retrieve links to all articles appearing on https://www.cbc.ca/news
Iteratively create separate tabs for each article
Capture a pdf screenshot and extract data (title, author, date posted, etc.)
Save screenshots of all articles to the same Google Drive folder
Save all extracted data to a Google Sheet

PROCESS UPDATE NOVEMBER 10, 2025

Spreadsheet generation, story capture, and collaboration

We’re making headway with the automated generation of our spreadsheets for the stories we capture, including the fix of one glitch.

We successfully – and by we I mean Nujaimah 🙂 – generated information about third-party authorship in the spreadsheet. By third-party, I mean content that is posted on one news site but was actually produced by another media organization or news agency.

However, the third-party information replaced the authors in the spreadsheet. So, now, we’re working towards making sure that both the authors and the organization the content comes from are listed if a third party is involved.

Nujaimah has also fixed an issue that came up in testing, where only one author name was being recorded when there were multiple authors in the byline. You can see what she did in her working notes below and in her code that’s available on GitHub to rectify that problem.

Our next steps include finishing up the process/methodology for capturing links for audio and video, and figuring out how we’re going to try and automate the identification of whether AI tools were used to help write/develop the story.

One issue that complicates that goal is that the terminology is not consistent across our sites of study, and news organizations in general, in identifying the use of AI. For example, AI can be referred to as robots, bots, a specific tool, eg., ChatGPT, less specifically as coming from a particular department that uses AI but doesn’t directly identify it in a story etc.

We have also identified that housing all of our captured stories is going to be a bit less straightforward than expected. Institutional firewalls prevent us from directly uploading the stories scraped with AI into any Google Drive housed by TMU. So, we’re going to pay for some extra storage using our JRP Canada account, and that is where all of the stories we collect through AI capture will first land.

Another next step will be seeing how the current capture methodology works on our French site of study, LaPresse. Nujaimah is testing that now.

And we’ll end with some exciting news! Our JRP colleagues in the U.K, working out of Bournemouth University, and Argentina, working out of Universidad Torcuato Di Tella, are going to start experimenting with our code to see if it works/how they can modify it for their own data captures.

Once they’ve done some tests, they’ll share their adventures in the comments so we can build our network of knowledge for researchers developing capture methods.

Nujaimah’s Workflow to capture authors and third-party contributions in the spreadsheet

Attempt	Problem	Solution
1	Third-party authors not being extracted (e.g. Associated Press, CBC News, Thomson Reuters)	Modify the HTML tag used for extracting to include the full contents
2	Articles with multiple authors, only extracting the first name	Modify the HTML tag used for extracting to include the full contents

Data Capture Workflow (for individual story captures):

Navigate to individual story links
Extract author(s), title, date posted, as well as additional author information (if available for third party authors)
Populate extracted data in a Google Sheet

AI Methods Blog

PROCESS UPDATE OCTOBER 20, 2025

First Attempts at CBC Capture

Newspaper Python package

PROCESS UPDATE OCTOBER 30, 2025

Individual Story Capture and Spreadsheet Generation

Nujaimah’s Workflow for Individual Story Capture and Automated Generation of a Google Spreadsheet

Story Capture

Updated Workflow (for homepage and individual story captures):

PROCESS UPDATE NOVEMBER 10, 2025

Spreadsheet generation, story capture, and collaboration

Nujaimah’s Workflow to capture authors and third-party contributions in the spreadsheet

Data Capture Workflow (for individual story captures):

Leave a Comment Cancel Reply