As part of the third wave of the Canadian branch of the JRP project, we’re figuring out what types of AI tools might help us automate the data capture of the Canadian news websites included in our sites of study.
And we’re going to share the development of our process, our plan, and our code right here on this blog. Ideally, we might even be able to create a network of people working on similar projects who can all support each other.
Our goal is to capture a specific section of the homepages of our sites of study at specific times of day on each of our 14 data collection days in 2026, and automatically populate a Google Drive folder with those homepage captures. We also want to automatically generate PDFs with working links for all of the news stories those homepages display, also loaded into a Google folder (these stories will be the ones we use for content analysis). And, then, we want to generate a spreadsheet that automatically populates with the headlines, authors/bylines, and whether or not there is any mention of the use of AI in all of the news stories that are captured.
Our lead developer/code writer is Nujaimah Ahmed, who is majoring in Computer Engineering with a specialization in Artificial Intelligence at TMU – most of what you read below comes from her. But there will also be input from her project supervisor, TMU prof Ang Misri, and less technical/more process-related input from Dr. Nicole Blanchett, also a prof at TMU, who wrote this introduction.
The blog is linear – so what happened first will always be at the top, and you’ll have to scroll down to see the latest iterations of the AI tools being developed/tested/used.
Follow along on GitHub and with our documentation of our process below, and let us know if you have a suggestion or question!
PROCESS UPDATE OCTOBER 20, 2025
First Attempts at CBC Capture
- Coded a Python script that captures a screenshot of the CBC news page and saves it to a local folder directory/google drive folder
- Tested same process using already existing automated tool (Browse AI):
- Implements the same program has a UI element to visualize what is happening during capture process
- Can only capture entire page, visible part of page, or certain selections on the page
- Each capture costs 1 credit
- 50 monthly credits for free
- Different subscription tiers available if more credits are needed
Python Script | Browse AI | |
---|---|---|
Pros | Can play around more with what is to be captured, until how far in the page, what components to be eliminated, etc. | No code, do not need to run a script Centralized (all captures can be done on one account and will be saved under that account)UI component with a robot assistant that can visually show what page is being loaded and captured |
Cons | Need to run the script Could run into some automation limitations | Can’t bound until where you want the screen capture to go up to Can do 50 captures max per month |
Update Log:
Attempt | Problem | Solution |
---|---|---|
1 | Errors from website protections not allowing screen captures in “headless” mode | Adjusted browser settings to work around them without a “headless” browser |
2 | Only the desktop view of the news page was being captured (the visible part) | Change settings to capture the full scrollable page |
3 | The full page was being captured, including “My Local” and ads | Find location of “My Local” on page and create page boundaries |
4 | The sidebar of “Popular Now” topics wasn’t showing up | Wait longer for page to fully load, and make the image capture area wider |
5 | .png file was being captured instead of .pdf | Implement same script for a pdf capture |
6 | Browser opens on computer during each capture due to issues with running “headless” mode (CBC detects automated browsing and blocks loading of content) | Keep “headless” mode off but minimize the window immediately after launching so the browser runs in the backgroundorRun script on a virtual machine |
7 | pdf capture, contains full content including “My Local” and ads(Viewport crop function only available for png capture not pdf) | Manually need to delete the pages that are not required |
8 | Google Cloud is restricted for TMU accounts, making APIs unusable | Use a personal non-TMU account to store the captures |
Final Workflow:
- Launch Chromium browser
- Create an incognito tab
- Navigate to https://www.cbc.ca/news and wait for page to load
- Find location of “My Local” on news page and create boundaries for capture
- Capture web page as png and pdf file
- Save files to a Google Drive folder “cbc-captures”
Things done:
- Screenshot of news page up until “My Local” section
- Incognito mode
- PDF and PNG capture
Things to do next:
- Automate for daily captures without having to constantly run the script
- Automation is different for a Windows and Mac and needs to be set up on each device
- Windows: Task Scheduler, Mac: macOS Launch Agents
- Automation is different for a Windows and Mac and needs to be set up on each device
Save images to a Google Drive- Restricted access to Google Cloud on TMU account (unable to use Google Drive API) -> need to use on a personal account
- All images saved to a personal Google Drive folder “cbc-capture”
Do headless capturing (no browser opens during screen capture – happening in the background)- Headless capturing not possible due to CBC restrictions for automation tasks (block users from accessing data using automated browsers) -> automate minimizing of browser immediately after launch and automatically close after completion
PDF instead of png- PDF capturing full page instead of until “My Local” -> needs manual deletion of a few pages
- PDF capture takes more time than png (approx. 2 minutes longer)
- Kept both options for now
- Create a code-free automated solution: cloud web platform for any user to easily access (don’t need the script to run on your own computer)
Ideas:
- Central google account for captures only
- Web platform hosted on cloud
Screenshot Example:

Newspaper Python package
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, …)
- NLP feature for keyword and summary extraction
Tests:
- Extract article urls on main page
- Extract the category that each article belongs to
- Extract title of each article
- Author and published dae errors due to CBC news article metadata
Example Output:
Title: Canada’s bet on an AI boom
Authors: [‘Cbc News’]
Published Date: None
—
Title: Democracies must remember their own values when tackling borders, says U.S. official
Authors: [‘Cbc Radio’]
Published Date: None
—
Title: Think Stonehenge rocks? Ken Follett’s new novel is for you
Authors: [‘Cbc Books’]
Published Date: None
—
Title: None
Authors: []
Published Date: None
—
Title: Mad at Dad by Janie Hao wins CBC Kids Reads 2025
Authors: [‘Cbc Books’]
Published Date: None
—
Title: Restaurants are bringing the heat as spicy dishes attract trend-chasing customers
Authors: []
Published Date: None
—
Title: Are you polychronic or monochronic? Struggling to manage your time could be due to your ‘time personality’
Authors: [‘Catherine Zhu Is A Writer’, ‘Associate Producer For Cbc Radio. Her Reporting Interests Include Science’, ‘Arts’, ‘Culture’, “Social Justice. She Holds A Master’S Degree In Journalism The University Of British Columbia. You Can Reach Her At Catherine.Zhu Cbc.Ca.”]
Published Date: None
—
Title: What was girlhood like in the early 2000s? Read these graphic memoirs to find out
Authors: [‘Bridget Raymundo Is A Multimedia Journalist’, ‘Producer Currently Working At Cbc Books. You Can Reach Her At Bridget.Raymundo Cbc.Ca’]
Published Date: None
—
Title: Librarian fired after refusing to censor 2SLGBTQ+ books wins $700K US settlement
Authors: []
Published Date: None
—
Title: 50 Years of Quirks & Quarks and half a century of science
Authors: [“Bob Mcdonald Is The Host Of Cbc Radio’S Award-Winning Weekly Science Program”, ‘Quirks’, ‘Quarks. He Is Also A Science Commentator For Cbc News Network’, “Cbc Tv’S The National. He Has Received Honorary Degrees”, ‘Is An Officer Of The Order Of Canada.’, “Bob Mcdonald’S Recent Columns”]
Published Date: None
—