I signed up to compete in a half-Ironman(tm) triathlon on September 8, 2024. Completing a half-Ironman(tm) involves:
- 1.2 mile swim
- 56 mile bike
- 13.1 mile run
It’s nice to have a solid training plan to prepare for such a demanding event, and many kind folks have compiled and published these online for public reference. Rather than constantly referring to a cluttered website, though, I wanted to put all of the recommended workouts on my calendar. I like my calendar.
By the power of python, beautifulsoup, and pandas, I’ve done just that - scraped my selected training plan from the web and re-arranged the information into a format which can be imported into Google Calendar, as I prefer.
How do?
Retrieve and parse the HTML, referring to my browser’s Inspect
utility to analyse relevant page structure.
1
2
3
4
5
6
7
8
9
10
11
12
import requests
import bs4
import pandas as pd
URL = 'https://www.triathlete.com/training/super-simple-ironman-70-3-triathlon-training-plan/'
RACE_DAY = '9/8/2024'
OUT_FILE = '~/Desktop/sc.csv'
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text)
tables = soup.find_all('table')
len(tables)
1
17
Prep data frame and append a new row for each day in each week-table (all tables except the first table).
1
2
3
4
5
6
7
8
9
10
11
12
13
df = pd.DataFrame({'dow': [], 'wo1': [], 'wo2': [], 'wo3': []})
for i, t in enumerate(tables):
if i == 0: continue
# print(f'Week {i}')
trs = t.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
# print(' | '.join(td.text for td in tds))
df.loc[df.shape[0]] = [td.text for td in tds] + ([''] if i < 16 else [])
df.head()
dow | wo1 | wo2 | wo3 | |
---|---|---|---|---|
0 | Monday | Rest | ||
1 | Tuesday | Bike 40 minutes moderate with 4 x 30-second sp... | ||
2 | Wednesday | Swim 800 yards total. Main set: 8 x 25 yards, ... | Run 4 miles moderate + 2 x 10-second hill spri... | |
3 | Thursday | Bike 40 minutes moderate. | ||
4 | Friday | Swim 800 yards total. Main set: 3 x 100 yards ... | Run 4 miles moderate. |
Given the race day as the end date, back-fill to assign dates until training day 1.
1
2
df['start date'] = pd.date_range(end=RACE_DAY, periods=df.shape[0])
df.head()
dow | wo1 | wo2 | wo3 | start date | |
---|---|---|---|---|---|
0 | Monday | Rest | 2024-05-20 | ||
1 | Tuesday | Bike 40 minutes moderate with 4 x 30-second sp... | 2024-05-21 | ||
2 | Wednesday | Swim 800 yards total. Main set: 8 x 25 yards, ... | Run 4 miles moderate + 2 x 10-second hill spri... | 2024-05-22 | |
3 | Thursday | Bike 40 minutes moderate. | 2024-05-23 | ||
4 | Friday | Swim 800 yards total. Main set: 3 x 100 yards ... | Run 4 miles moderate. | 2024-05-24 |
I want each workout to have its own event, e.g. multiple workouts in one day should result in multiple events that day.
So, melt the multiple wo
workout columns into just one, in a long format with repeated dates. Then, drop rows with empty descriptions, so days where there are fewer workouts don’t have extra empty events.
1
2
3
4
5
6
7
pivot = df.melt(
id_vars='start date',
value_name='description',
value_vars=['wo1', 'wo2', 'wo3']
)
pivot_reduced = pivot.drop('variable', axis=1).drop(pivot[pivot['description'] == ''].index)
pivot_reduced.head()
start date | description | |
---|---|---|
0 | 2024-05-20 | Rest |
1 | 2024-05-21 | Bike 40 minutes moderate with 4 x 30-second sp... |
2 | 2024-05-22 | Swim 800 yards total. Main set: 8 x 25 yards, ... |
3 | 2024-05-23 | Bike 40 minutes moderate. |
4 | 2024-05-24 | Swim 800 yards total. Main set: 3 x 100 yards ... |
Use the first word of each workout description as its event subject/title, and re-sort by date for visual review.
1
2
pivot_reduced['subject'] = [x[0].upper() for x in pivot_reduced['description'].str.split()]
pivot_reduced.sort_values('start date')
start date | description | subject | |
---|---|---|---|
0 | 2024-05-20 | Rest | REST |
1 | 2024-05-21 | Bike 40 minutes moderate with 4 x 30-second sp... | BIKE |
2 | 2024-05-22 | Swim 800 yards total. Main set: 8 x 25 yards, ... | SWIM |
114 | 2024-05-22 | Run 4 miles moderate + 2 x 10-second hill spri... | RUN |
3 | 2024-05-23 | Bike 40 minutes moderate. | BIKE |
... | ... | ... | ... |
221 | 2024-09-06 | Run 3 miles easy. | RUN |
110 | 2024-09-07 | Swim 10 minutes easy with 4 x 30 seconds at ra... | SWIM |
222 | 2024-09-07 | Bike 10 minutes with 4 x 30 seconds fast. | BIKE |
334 | 2024-09-07 | Run 10 minutes with 4 x 20 seconds at 90 perce... | RUN |
111 | 2024-09-08 | RACE DAY | RACE |
161 rows × 3 columns
Write dataframe to CSV for calendar import.
1
pivot_reduced.sort_values('start date').to_csv(OUT_FILE, index=False)
Huzzah