Home Scraping Together a Triathlon Training Plan
Post
Cancel

Scraping Together a Triathlon Training Plan

I signed up to compete in a half-Ironman(tm) triathlon on September 8, 2024. Completing a half-Ironman(tm) involves:

  • 1.2 mile swim
  • 56 mile bike
  • 13.1 mile run

It’s nice to have a solid training plan to prepare for such a demanding event, and many kind folks have compiled and published these online for public reference. Rather than constantly referring to a cluttered website, though, I wanted to put all of the recommended workouts on my calendar. I like my calendar.

By the power of python, beautifulsoup, and pandas, I’ve done just that - scraped my selected training plan from the web and re-arranged the information into a format which can be imported into Google Calendar, as I prefer.

How do?

Retrieve and parse the HTML, referring to my browser’s Inspect utility to analyse relevant page structure.

1
2
3
4
5
6
7
8
9
10
11
12
import requests
import bs4
import pandas as pd

URL = 'https://www.triathlete.com/training/super-simple-ironman-70-3-triathlon-training-plan/'
RACE_DAY = '9/8/2024'
OUT_FILE = '~/Desktop/sc.csv'

response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text)
tables = soup.find_all('table')
len(tables)
1
17

Prep data frame and append a new row for each day in each week-table (all tables except the first table).

1
2
3
4
5
6
7
8
9
10
11
12
13
df = pd.DataFrame({'dow': [], 'wo1': [], 'wo2': [], 'wo3': []})

for i, t in enumerate(tables):
    if i == 0: continue
    # print(f'Week {i}')
    
    trs = t.find_all('tr')
    for tr in trs:
        tds = tr.find_all('td')
        # print(' | '.join(td.text for td in tds))
        df.loc[df.shape[0]] = [td.text for td in tds] + ([''] if i < 16 else [])
        
df.head()
dowwo1wo2wo3
0MondayRest
1TuesdayBike 40 minutes moderate with 4 x 30-second sp...
2WednesdaySwim 800 yards total. Main set: 8 x 25 yards, ...Run 4 miles moderate + 2 x 10-second hill spri...
3ThursdayBike 40 minutes moderate.
4FridaySwim 800 yards total. Main set: 3 x 100 yards ...Run 4 miles moderate.

Given the race day as the end date, back-fill to assign dates until training day 1.

1
2
df['start date'] = pd.date_range(end=RACE_DAY, periods=df.shape[0])
df.head()
dowwo1wo2wo3start date
0MondayRest2024-05-20
1TuesdayBike 40 minutes moderate with 4 x 30-second sp...2024-05-21
2WednesdaySwim 800 yards total. Main set: 8 x 25 yards, ...Run 4 miles moderate + 2 x 10-second hill spri...2024-05-22
3ThursdayBike 40 minutes moderate.2024-05-23
4FridaySwim 800 yards total. Main set: 3 x 100 yards ...Run 4 miles moderate.2024-05-24

I want each workout to have its own event, e.g. multiple workouts in one day should result in multiple events that day.

So, melt the multiple wo workout columns into just one, in a long format with repeated dates. Then, drop rows with empty descriptions, so days where there are fewer workouts don’t have extra empty events.

1
2
3
4
5
6
7
pivot = df.melt(
    id_vars='start date', 
    value_name='description', 
    value_vars=['wo1', 'wo2', 'wo3']
)
pivot_reduced = pivot.drop('variable', axis=1).drop(pivot[pivot['description'] == ''].index)
pivot_reduced.head()
start datedescription
02024-05-20Rest
12024-05-21Bike 40 minutes moderate with 4 x 30-second sp...
22024-05-22Swim 800 yards total. Main set: 8 x 25 yards, ...
32024-05-23Bike 40 minutes moderate.
42024-05-24Swim 800 yards total. Main set: 3 x 100 yards ...

Use the first word of each workout description as its event subject/title, and re-sort by date for visual review.

1
2
pivot_reduced['subject'] = [x[0].upper() for x in pivot_reduced['description'].str.split()]
pivot_reduced.sort_values('start date')
start datedescriptionsubject
02024-05-20RestREST
12024-05-21Bike 40 minutes moderate with 4 x 30-second sp...BIKE
22024-05-22Swim 800 yards total. Main set: 8 x 25 yards, ...SWIM
1142024-05-22Run 4 miles moderate + 2 x 10-second hill spri...RUN
32024-05-23Bike 40 minutes moderate.BIKE
............
2212024-09-06Run 3 miles easy.RUN
1102024-09-07Swim 10 minutes easy with 4 x 30 seconds at ra...SWIM
2222024-09-07Bike 10 minutes with 4 x 30 seconds fast.BIKE
3342024-09-07Run 10 minutes with 4 x 20 seconds at 90 perce...RUN
1112024-09-08RACE DAYRACE

161 rows × 3 columns

Write dataframe to CSV for calendar import.

1
pivot_reduced.sort_values('start date').to_csv(OUT_FILE, index=False)

Huzzah

This post is licensed under CC BY 4.0 by the author.