This time I decided to publish the whole notebook as this project was a real fun and also a great challenge and I learned lots.
If you’re interested only in the stats I managed to retrieve, just scroll to the bottom of this post.
Similarly, if you’re into the coding side of things and you spot something that could’ve been done easier, I’m very happy for any comment or advice I can get!
So without further ado, put your skis on, check your rifle and let’s hit the track...
Ready - steady - Biathlon!¶
For those who are not yet familiar with the best sport in the world, biathlon rules vary slightly with every race format, but put simply it consists of two equally important parts: cross-country skiing and shooting. The races happen on a closed loop while in between each lap, the competitors pass a shooting range with 5 targets. For each target missed, they get some sort of penalty, depending on the race format (relay, time trial, sprint, pursuit, …).
For my analysis I chose a women’s mass start from 17th Jan 2021. Subjectively, it was the best race yet this season. In this race 30 of the world’s best biathletes skied 5 laps of 2.5km each, with shooting in between each of them; 2 prone followed by 2 standing. For each missed target there is a 150m penalty loop before they can get back on the track.
I used this pdf from official IBU page: https://ibu.blob.core.windows.net/docs/2021/BT/SWRL/CP06/SWMS/C77D_v1.pdf.
As you can see, there’s a lot going on there and it was real fun to hammer it into a shape.
Importing libraries¶
Importing all the libraries I need. Also my favourite Seaborn plot is becoming deprecated so I muted the warning messages.
import requests
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
#ignore warnings re distplot deprecated
import warnings
warnings.filterwarnings("ignore")
Thanks to Pythonic Accountant (YouTube) for requests and pdfplumber tutorial.
# defining function for downloading the file
def download_file(url):
local_filename = url.split('/')[-1] #name of the file after the last slash '/'
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
file_url = 'https://ibu.blob.core.windows.net/docs/2021/BT/SWRL/CP06/SWMS/C77D_v1.pdf'
#run the function
file = download_file(file_url)
Now I have my pdf.
Next I'll use a while loop to go page by page and convert the pdf to string using pdfplumber.
I later found out that for loop might be better for this and used it later in this notebook.
with pdfplumber.open(file) as pdf: #file is a variable set to the name of my pdf file
i=0
text=''
while i<len(pdf.pages): #while loop scans page by page
page = pdf.pages[i]
text += (page.extract_text())
i+=1
text = text.split('\n')[5:] #splitting into lines and getting rid of the header (race info)
#make sure to run only once!!!!!!!
headers = text[:4] #taking the headers
text = text[:-4] #getting rid of the last footnote
Here's what the first few lines look like.
Notice it's a list of strings,
with each string being 1 line from the original pdf file. That will come very handy.
text[:6] # showing first 6 lines
['Rank Bib Name Nat T', 'Loop 1 Loop 2 Loop 3 Loop 4 Lap 5', 'Result Behind Rk', 'Time Rk Time Rk Time Rk Time Rk Time Rk', '1 15 SIMON Julia FRA 3 40:11.1 0.0 1', 'Cumulative Time 8:03.3 +0.9 2 16:40.6 +25.7 12 25:25.8 +57.8 11 33:29.0 0.0 1 40:11.1 0.0 1']
Now I'll get rid of the headers and the footnotes that keep repeating on each page.
aux=[] #empty list for text without headers and footnotes
for s in text:
if s in headers:
pass
elif s.startswith('BTHW12'): #that's the footnotes
pass
else:
aux.append(s)
That's what the first few lines look like now.
aux[:6]
['1 15 SIMON Julia FRA 3 40:11.1 0.0 1', 'Cumulative Time 8:03.3 +0.9 2 16:40.6 +25.7 12 25:25.8 +57.8 11 33:29.0 0.0 1 40:11.1 0.0 1', 'Loop Time 8:03.3 +0.9 2 8:37.3 +45.0 20 8:45.2 +41.7 18 8:03.2 0.0 1 6:42.1 +8.6 5', 'Shooting 0 24.6 +1.7 4 1 29.7 +6.5 10 2 21.9 +2.3 3 0 21.4 0.0 1 3 1:37.8 +5.5 2', 'Range Time 45.5 +1.8 3 50.4 +4.2 8 43.1 +1.8 3 42.0 0.0 1 3:01.0 +1.9 2', 'Course Time 7:12.2 +2.3 11 7:16.7 +16.8 21 7:08.0 +2.6 3 7:15.3 +7.6 4 6:42.1 +8.6 5 35:34.3 +11.9 6']
Now I'm creating a nested list that I called imprecisely 'sublist' for a lack of imagination.
That'll group data for each athlete into one item in the list.
sublist = [] #will create nested lists with data for each athlete grouped together
#scanning for the last line for each athlete (Penalty)
#probably better solution than while loop?
i=0
while i<len(aux):
if aux[i].startswith('Penalty'): #Penalty is the last line per athlete
sublist.append(aux[:i+1])
del aux[:i+1]
i=0
else:
i+=1
This is one item in the nested list. It just so happened to be a Czech athlete.
sublist[9]
['10 9 DAVIDOVA Marketa CZE 3 40:44.1 +33.0 10', 'Cumulative Time 8:06.2 +3.8 5 16:19.6 +4.7 4 25:01.3 +33.3 3 34:07.6 +38.6 13 40:44.1 +33.0 10', 'Loop Time 8:06.2 +3.8 5 8:13.4 +21.1 10 8:41.7 +38.2 16 9:06.3 +1:03.1 22 6:36.5 +3.0 2', 'Shooting 0 27.9 +5.0 12 0 31.7 +8.5 15 1 31.7 +12.1 27 2 33.2 +11.8 24 3 2:04.7 +32.4 21', 'Range Time 49.4 +5.7 15 52.5 +6.3 =14 52.2 +10.9 26 53.4 +11.4 23 3:27.5 +28.4 22', 'Course Time 7:10.3 +0.4 3 7:13.2 +13.3 15 7:18.3 +12.9 12 7:19.8 +12.1 8 6:36.5 +3.0 2 35:38.1 +15.7 7', 'Penalty Time 6.5 7.6 31.2 53.1 1:38.5']
Now I'm about to split the first line into words rather than one long string.
In the first line I see that the ranking is displayed at both ends.
I'll get rid of one and I'll also merge name and surname into one string.
I had to create a little if condition as some athletes use a middle name.
#MAKE SURE TO RUN ONCE ONLY
for s in sublist:
if len(s[0].split()) == 9: #if name only NAME + SURNAME
s[0] = s[0].split()[:-1] #getting rid of copy of rank
s[0][2] = s[0][2] + ' ' + s[0][3] #concat surname+name
del s[0][3]#delete surname
elif len(s[0].split()) == 10: #if name NAME + MIDDLE NAME + SURNAME
s[0] = s[0].split()[:-1] #getting rid of copy of rank
s[0][2] = s[0][2] + ' ' + s[0][3] + ' ' + s[0][4]#concat surname+name+middle name
del s[0][3:5]#delete middle name+surname
sublist[6][0] #first line
['7', '1', 'ROEISELAND Marte Olsbu', 'NOR', '4', '40:41.4', '+30.3']
Next I find things that interest me for my analysis, retrieve them from the sublist and create dataframes.
athletes = []
for s in sublist:
athletes+=s[0] #getting first row from sublist
athletes = np.reshape(athletes, (30,7)) #rearranging to fit into a df
athletes_df = pd.DataFrame(athletes, columns=('Rank Bib Name Nat T Time Behind').split())
#convert behind into seconds and float
Behind = [s.lstrip('+').split(':') for s in athletes_df['Behind']] #split into min and sec
Behind = [float(i[0]) if len(i)==1 else float(i[0])*60+float(i[1]) for i in Behind] #calculate behind time in s
athletes_df['Behind']=Behind
#convert rank,bib,T(targets missed) to int
athletes_df[['Rank','Bib','T']]=athletes_df[['Rank','Bib','T']].astype(int)
#now course time
#course time is [-2] in sublist and I only need the final time and behind[-3:-1]
course_time = []
for s in sublist:
course_time += s[-2].split()[-3:-1]
course_time = np.reshape(course_time, (30,2))
course_time_df = pd.DataFrame(course_time, columns=['Course Time','CT Behind'])
CTbehind = [s.lstrip('+').split(':') for s in course_time_df['CT Behind']] #split into min and sec
CTbehind = [float(i[0]) if len(i)==1 else float(i[0])*60+float(i[1]) for i in CTbehind] #calculate behind time in s
course_time_df['CT Behind'] = CTbehind
#finding fastest lap excluding shooting
#split string and find indices with lap times: 2,5,8,11,,14
lap_times=[]
for s in sublist:
laps = [s[-2].split()[i] for i in [2,5,8,11,14]]
laps = [float(i.split(':')[0])*60 + float(i.split(':')[1]) for i in laps] #converting to seconds
lap_times.append(laps)
lap_times_df = pd.DataFrame(lap_times, columns=['Lap 1', 'Lap 2','Lap 3','Lap 4','Lap 5'])
lap_times_df['Fastest lap'] = lap_times_df.min(axis=1) #creating Fastest lap column
#calculate fastest lap behind time
lap_times_df['FL Behind (s)'] = lap_times_df['Fastest lap']-lap_times_df['Fastest lap'].min()
# and convert to fastest lap back to MM:SS string
lap_times_df['Fastest lap']=[str(t/60).split('.')[0] +':'+ str(round(t%60, 1)) for t in lap_times_df['Fastest lap']]
#range time
range_times =[]
for s in sublist:
range_time = [(s[-3].split()[i]) for i in [2,5,8,11]] # 2,5,8,11 indices of interest
range_times += range_time
#range times to seconds and float
range_times = [float(s.split(':')[0])*60 + float(s.split(':')[1]) if len(s.split(':')) == 2 else float(s) for s in range_times]
range_df=pd.DataFrame(np.reshape(range_times, (30,4)), columns=['Range 1', 'Range 2', 'Range 3', 'Range 4'])
range_df['Total range'] = range_df.sum(axis=1)
#shooting times
shooting_times =[]
for s in sublist:
shooting_time = [float(s[3].split()[i]) for i in [2,6,10,14]] #2,6,10,14 indices of interest
#converting straight to float
shooting_times += shooting_time
shooting_df=pd.DataFrame(np.reshape(shooting_times, (30,4)), columns=['Shoot 1', 'Shoot 2', 'Shoot 3', 'Shoot 4'])
shooting_df['Total shooting'] = shooting_df.sum(axis=1)
#missed targets
missed_targets=[]
for s in sublist:
targets = [s[3].split()[i] for i in [1,5,9,13]]
missed_targets.append(targets)
missed_targets_df = pd.DataFrame(missed_targets, columns=['Shoot 1', 'Shoot 2', 'Shoot 3', 'Shoot 4'])
Now I can concatenate all the dfs into one big dataframe as the index will be the same for each of them.
df = pd.concat((athletes_df,
course_time_df,
lap_times_df[['Fastest lap','FL Behind (s)']],
shooting_df['Total shooting'],
range_df['Total range']), axis=1)
This is the data for the top 5:¶
- Rank: what it says on the box, a position
- Bib: bib number
- Name, Nationality
- T: Missed targets (out of 20)
- Time: Total Time
- Behind: Time behind the winner in seconds
- Course Time: Time excluding shooting range and penalty loops
- CT Behind: Time behind the fastest Course Time in seconds
- Fastest lap: Fastest lap of each athlete, excluding shooting range and penalty loops
- FL Behind: Athlete's fastest lap compared to the best lap overall (seconds)
- Total shooting time in seconds
- Total range: Total time spent on the range in seconds
df.head()
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15 | SIMON Julia | FRA | 3 | 40:11.1 | 0.0 | 35:34.3 | 11.9 | 6:42.1 | 8.6 | 97.6 | 181.0 |
1 | 2 | 6 | PREUSS Franziska | GER | 2 | 40:15.0 | 3.9 | 35:49.4 | 27.0 | 6:43.3 | 9.8 | 103.7 | 191.5 |
2 | 3 | 3 | OEBERG Hanna | SWE | 3 | 40:22.8 | 11.7 | 35:42.0 | 19.6 | 6:38.9 | 5.4 | 101.0 | 184.2 |
3 | 4 | 13 | TANDREVOLD Ingrid Landmark | NOR | 2 | 40:24.1 | 13.0 | 35:44.5 | 22.1 | 6:46.3 | 12.8 | 121.4 | 203.2 |
4 | 5 | 20 | BRORSSON Mona | SWE | 1 | 40:26.0 | 14.9 | 36:21.1 | 58.7 | 6:49.5 | 16.0 | 112.1 | 197.8 |
Best shooters, 2 or less targets missed:
df[df['T']<=2].sort_values('T')
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | 20 | BRORSSON Mona | SWE | 1 | 40:26.0 | 14.9 | 36:21.1 | 58.7 | 6:49.5 | 16.0 | 112.1 | 197.8 |
1 | 2 | 6 | PREUSS Franziska | GER | 2 | 40:15.0 | 3.9 | 35:49.4 | 27.0 | 6:43.3 | 9.8 | 103.7 | 191.5 |
3 | 4 | 13 | TANDREVOLD Ingrid Landmark | NOR | 2 | 40:24.1 | 13.0 | 35:44.5 | 22.1 | 6:46.3 | 12.8 | 121.4 | 203.2 |
5 | 6 | 5 | WIERER Dorothea | ITA | 2 | 40:39.9 | 28.8 | 36:24.0 | 61.6 | 7:5.3 | 31.8 | 92.1 | 179.1 |
17 | 18 | 22 | HETTICH Janina | GER | 2 | 41:25.5 | 74.4 | 36:36.5 | 74.1 | 7:7.3 | 33.8 | 132.1 | 213.7 |
23 | 24 | 14 | KNOTTEN Karoline Offigstad | NOR | 2 | 42:25.9 | 134.8 | 37:47.9 | 145.5 | 7:14.1 | 40.6 | 105.3 | 193.1 |
Fastest shooters. Notice D. Wierer fastest shooting time and also one of the most accurate!
df.sort_values('Total shooting').head()
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 5 | WIERER Dorothea | ITA | 2 | 40:39.9 | 28.8 | 36:24.0 | 61.6 | 7:5.3 | 31.8 | 92.1 | 179.1 |
0 | 1 | 15 | SIMON Julia | FRA | 3 | 40:11.1 | 0.0 | 35:34.3 | 11.9 | 6:42.1 | 8.6 | 97.6 | 181.0 |
2 | 3 | 3 | OEBERG Hanna | SWE | 3 | 40:22.8 | 11.7 | 35:42.0 | 19.6 | 6:38.9 | 5.4 | 101.0 | 184.2 |
1 | 2 | 6 | PREUSS Franziska | GER | 2 | 40:15.0 | 3.9 | 35:49.4 | 27.0 | 6:43.3 | 9.8 | 103.7 | 191.5 |
6 | 7 | 1 | ROEISELAND Marte Olsbu | NOR | 4 | 40:41.4 | 30.3 | 35:33.2 | 10.8 | 6:46.5 | 13.0 | 105.0 | 187.1 |
Athletes who spent the least time on the range.
That differs from the clear shooting time as it also involves approaching the targets,
getting breath under control, etc.
But yes, it is the same 5 ladies.
df.sort_values('Total range').head()
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 5 | WIERER Dorothea | ITA | 2 | 40:39.9 | 28.8 | 36:24.0 | 61.6 | 7:5.3 | 31.8 | 92.1 | 179.1 |
0 | 1 | 15 | SIMON Julia | FRA | 3 | 40:11.1 | 0.0 | 35:34.3 | 11.9 | 6:42.1 | 8.6 | 97.6 | 181.0 |
2 | 3 | 3 | OEBERG Hanna | SWE | 3 | 40:22.8 | 11.7 | 35:42.0 | 19.6 | 6:38.9 | 5.4 | 101.0 | 184.2 |
6 | 7 | 1 | ROEISELAND Marte Olsbu | NOR | 4 | 40:41.4 | 30.3 | 35:33.2 | 10.8 | 6:46.5 | 13.0 | 105.0 | 187.1 |
1 | 2 | 6 | PREUSS Franziska | GER | 2 | 40:15.0 | 3.9 | 35:49.4 | 27.0 | 6:43.3 | 9.8 | 103.7 | 191.5 |
Fastest skiers overall.
df.sort_values('CT Behind').head()
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | 15 | 11 | HERRMANN Denise | GER | 4 | 40:56.5 | 45.4 | 35:22.4 | 0.0 | 6:43.3 | 9.8 | 127.8 | 210.4 |
8 | 9 | 12 | BRAISAZ-BOUCHET Justine | FRA | 4 | 40:43.7 | 32.6 | 35:23.2 | 0.8 | 6:45.1 | 11.6 | 120.6 | 205.3 |
10 | 11 | 25 | MIRONOVA Svetlana | RUS | 4 | 40:44.4 | 33.3 | 35:26.9 | 4.5 | 6:33.5 | 0.0 | 115.5 | 201.2 |
7 | 8 | 2 | ECKHOFF Tiril | NOR | 4 | 40:43.2 | 32.1 | 35:30.3 | 7.9 | 6:37.6 | 4.1 | 112.7 | 195.8 |
6 | 7 | 1 | ROEISELAND Marte Olsbu | NOR | 4 | 40:41.4 | 30.3 | 35:33.2 | 10.8 | 6:46.5 | 13.0 | 105.0 | 187.1 |
Comparing the fastest laps. Go team Czech Republic, woohoo!!!
df.sort_values('FL Behind (s)').head()
Rank | Bib | Name | Nat | T | Time | Behind | Course Time | CT Behind | Fastest lap | FL Behind (s) | Total shooting | Total range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 11 | 25 | MIRONOVA Svetlana | RUS | 4 | 40:44.4 | 33.3 | 35:26.9 | 4.5 | 6:33.5 | 0.0 | 115.5 | 201.2 |
9 | 10 | 9 | DAVIDOVA Marketa | CZE | 3 | 40:44.1 | 33.0 | 35:38.1 | 15.7 | 6:36.5 | 3.0 | 124.5 | 207.5 |
7 | 8 | 2 | ECKHOFF Tiril | NOR | 4 | 40:43.2 | 32.1 | 35:30.3 | 7.9 | 6:37.6 | 4.1 | 112.7 | 195.8 |
2 | 3 | 3 | OEBERG Hanna | SWE | 3 | 40:22.8 | 11.7 | 35:42.0 | 19.6 | 6:38.9 | 5.4 | 101.0 | 184.2 |
0 | 1 | 15 | SIMON Julia | FRA | 3 | 40:11.1 | 0.0 | 35:34.3 | 11.9 | 6:42.1 | 8.6 | 97.6 | 181.0 |
How many athletes per nationality in the race.
df['Nat'].value_counts()
GER 4 SWE 4 FRA 4 NOR 4 AUT 3 ITA 2 RUS 2 BLR 2 CAN 1 USA 1 EST 1 CZE 1 KOR 1 Name: Nat, dtype: int64
Now let's get visual¶
I learned to love distribution plots as they show so much more than a simple mean average value.
This first one shows how much time it took athletes to fire their 5 shots at each of the 4 shootings.¶
plt.figure(figsize=(12,7))
for column in shooting_df.columns.drop('Total shooting'): #each columns except for SUM total
ax = sns.distplot(shooting_df[column], label=column, axlabel='Time (s)', kde_kws={'lw':4},
hist_kws={'alpha':0.3,'label':column})
ax.set_title('Time per shoot')
ax.set(xlim=(min(shooting_times)-0.5, max(shooting_times)+0.5))
plt.legend()
<matplotlib.legend.Legend at 0x7eff4434acd0>
Now the same for the time spent at the range¶
plt.figure(figsize=(12,7))
for column in range_df.columns.drop('Total range'): #each columns except for SUM total
ax = sns.distplot(range_df[column], label=column, axlabel='Time (s)', kde_kws={'lw':4},
hist_kws={'alpha':0.3,'label':column})
ax.set_title('Time spent on range')
ax.set(xlim=(min(range_times)-0.5, max(range_times)+0.5))
plt.legend()
<matplotlib.legend.Legend at 0x7eff443292b0>
Targets missed at each shooting¶
Notice shootings 3&4 distributed slightly more towards the right = more mistakes.
Compared to let's say blue shoot one with mostly 0 and 1 mistakes.
plt.figure(figsize=(12,7))
for column in missed_targets_df.columns:
ax=sns.distplot(missed_targets_df[column], axlabel='Targets missed', kde_kws={'lw':4},
hist_kws={'alpha':0.3,'label':column})
ax.set_title('Distribution of missed targets per shoot')
ax.set_xlim(-0.5,5.5)
plt.legend()
<matplotlib.legend.Legend at 0x7eff43da2cd0>
Ski speed¶
And finally let's look at the distribution of time per each lap.
Here you can see the narrow blue lap one as the athletes were mostly together.
Then the curves progressively flatten as differences in physical condition and stamina start to show.
And finally the purple lap 5 out of 5 clearly the fastest lap of them all as each athlete just pushed hard towards the finish.
plt.figure(figsize=(12,7))
for column in lap_times_df.columns.drop(['Fastest lap','FL Behind (s)']): #each columns except for SUM total
ax = sns.distplot(lap_times_df[column], label=column,
axlabel='Time (mm:ss)', kde_kws={'lw':4},
hist_kws={'alpha':0.3,'label':column})
ax.set(xlim=(400,485))
ax.set_title('Time distribution per lap (Excl. shooting)')
xticks = ax.get_xticks()
#convert x ticks to mm:ss
xticks = [str(tick/60).split('.')[0] +':'+ str(round(tick%60, 1)) for tick in xticks]
ax.set_xticklabels(xticks)
plt.legend()
<matplotlib.legend.Legend at 0x7eff44265400>
And what a finish that was! You can watch it on the official IBU TV here: https://www.eurovisionsports.tv/ibu#AO7BBV3KHP