biathlon

This time I decided to publish the whole notebook as this project was a real fun and also a great challenge and I learned lots.
If you’re interested only in the stats I managed to retrieve, just scroll to the bottom of this post.
Similarly, if you’re into the coding side of things and you spot something that could’ve been done easier, I’m very happy for any comment or advice I can get!

So without further ado, put your skis on, check your rifle and let’s hit the track...

Ready - steady - Biathlon!¶

For those who are not yet familiar with the best sport in the world, biathlon rules vary slightly with every race format, but put simply it consists of two equally important parts: cross-country skiing and shooting. The races happen on a closed loop while in between each lap, the competitors pass a shooting range with 5 targets. For each target missed, they get some sort of penalty, depending on the race format (relay, time trial, sprint, pursuit, …).

For my analysis I chose a women’s mass start from 17th Jan 2021. Subjectively, it was the best race yet this season. In this race 30 of the world’s best biathletes skied 5 laps of 2.5km each, with shooting in between each of them; 2 prone followed by 2 standing. For each missed target there is a 150m penalty loop before they can get back on the track.

I used this pdf from official IBU page: https://ibu.blob.core.windows.net/docs/2021/BT/SWRL/CP06/SWMS/C77D_v1.pdf.

As you can see, there’s a lot going on there and it was real fun to hammer it into a shape.

Importing libraries¶

Importing all the libraries I need. Also my favourite Seaborn plot is becoming deprecated so I muted the warning messages.

In [1]:

import requests
import pdfplumber
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

#ignore warnings re distplot deprecated
import warnings
warnings.filterwarnings("ignore")

Thanks to Pythonic Accountant (YouTube) for requests and pdfplumber tutorial.

In [2]:

# defining function for downloading the file
def download_file(url):
    local_filename = url.split('/')[-1] #name of the file after the last slash '/'
    
    with requests.get(url) as r:
        with open(local_filename, 'wb') as f:
            f.write(r.content)
    
    return local_filename

file_url = 'https://ibu.blob.core.windows.net/docs/2021/BT/SWRL/CP06/SWMS/C77D_v1.pdf'
#run the function
file = download_file(file_url)

Now I have my pdf. Next I'll use a while loop to go page by page and convert the pdf to string using pdfplumber.
I later found out that for loop might be better for this and used it later in this notebook.

In [3]:

with pdfplumber.open(file) as pdf: #file is a variable set to the name of my pdf file
    i=0
    text=''
    
    while i<len(pdf.pages): #while loop scans page by page
        page = pdf.pages[i]
        text += (page.extract_text())
        i+=1

text = text.split('\n')[5:] #splitting into lines and getting rid of the header (race info)
#make sure to run only once!!!!!!!

headers = text[:4] #taking the headers
text = text[:-4] #getting rid of the last footnote

Here's what the first few lines look like.
Notice it's a list of strings, with each string being 1 line from the original pdf file. That will come very handy.

In [4]:

text[:6] # showing first 6 lines

Out[4]:

['Rank Bib Name Nat T',
 'Loop 1 Loop 2 Loop 3 Loop 4 Lap 5',
 'Result Behind Rk',
 'Time Rk Time Rk Time Rk Time Rk Time Rk',
 '1 15 SIMON Julia FRA 3 40:11.1 0.0 1',
 'Cumulative Time 8:03.3 +0.9 2 16:40.6 +25.7 12 25:25.8 +57.8 11 33:29.0 0.0 1 40:11.1 0.0 1']

Now I'll get rid of the headers and the footnotes that keep repeating on each page.

In [5]:

aux=[] #empty list for text without headers and footnotes 
for s in text:
    if s in headers:
        pass
    elif s.startswith('BTHW12'): #that's the footnotes
        pass
    else:
        aux.append(s)

That's what the first few lines look like now.

In [6]:

aux[:6]

Out[6]:

['1 15 SIMON Julia FRA 3 40:11.1 0.0 1',
 'Cumulative Time 8:03.3 +0.9 2 16:40.6 +25.7 12 25:25.8 +57.8 11 33:29.0 0.0 1 40:11.1 0.0 1',
 'Loop Time 8:03.3 +0.9 2 8:37.3 +45.0 20 8:45.2 +41.7 18 8:03.2 0.0 1 6:42.1 +8.6 5',
 'Shooting 0 24.6 +1.7 4 1 29.7 +6.5 10 2 21.9 +2.3 3 0 21.4 0.0 1 3 1:37.8 +5.5 2',
 'Range Time 45.5 +1.8 3 50.4 +4.2 8 43.1 +1.8 3 42.0 0.0 1 3:01.0 +1.9 2',
 'Course Time 7:12.2 +2.3 11 7:16.7 +16.8 21 7:08.0 +2.6 3 7:15.3 +7.6 4 6:42.1 +8.6 5 35:34.3 +11.9 6']

Now I'm creating a nested list that I called imprecisely 'sublist' for a lack of imagination.
That'll group data for each athlete into one item in the list.

In [7]:

sublist = []   #will create nested lists with data for each athlete grouped together
#scanning for the last line for each athlete (Penalty)
#probably better solution than while loop?
i=0
while i<len(aux):
    if aux[i].startswith('Penalty'): #Penalty is the last line per athlete
        sublist.append(aux[:i+1])
        del aux[:i+1] 
        i=0
    else:
        i+=1

This is one item in the nested list. It just so happened to be a Czech athlete.

In [8]:

sublist[9]

Out[8]:

['10 9 DAVIDOVA Marketa CZE 3 40:44.1 +33.0 10',
 'Cumulative Time 8:06.2 +3.8 5 16:19.6 +4.7 4 25:01.3 +33.3 3 34:07.6 +38.6 13 40:44.1 +33.0 10',
 'Loop Time 8:06.2 +3.8 5 8:13.4 +21.1 10 8:41.7 +38.2 16 9:06.3 +1:03.1 22 6:36.5 +3.0 2',
 'Shooting 0 27.9 +5.0 12 0 31.7 +8.5 15 1 31.7 +12.1 27 2 33.2 +11.8 24 3 2:04.7 +32.4 21',
 'Range Time 49.4 +5.7 15 52.5 +6.3 =14 52.2 +10.9 26 53.4 +11.4 23 3:27.5 +28.4 22',
 'Course Time 7:10.3 +0.4 3 7:13.2 +13.3 15 7:18.3 +12.9 12 7:19.8 +12.1 8 6:36.5 +3.0 2 35:38.1 +15.7 7',
 'Penalty Time 6.5 7.6 31.2 53.1 1:38.5']

Now I'm about to split the first line into words rather than one long string. In the first line I see that the ranking is displayed at both ends.
I'll get rid of one and I'll also merge name and surname into one string.
I had to create a little if condition as some athletes use a middle name.

In [9]:

#MAKE SURE TO RUN ONCE ONLY
for s in sublist:
    if len(s[0].split()) == 9: #if name only NAME + SURNAME
        s[0] = s[0].split()[:-1] #getting rid of copy of rank
        s[0][2] = s[0][2] + ' ' + s[0][3] #concat surname+name
        del s[0][3]#delete surname
    
    elif len(s[0].split()) == 10: #if name NAME + MIDDLE NAME + SURNAME
        s[0] = s[0].split()[:-1] #getting rid of copy of rank
        s[0][2] = s[0][2] + ' ' + s[0][3] + ' ' + s[0][4]#concat surname+name+middle name
        del s[0][3:5]#delete middle name+surname

In [10]:

sublist[6][0] #first line

Out[10]:

['7', '1', 'ROEISELAND Marte Olsbu', 'NOR', '4', '40:41.4', '+30.3']

Next I find things that interest me for my analysis, retrieve them from the sublist and create dataframes.

In [11]:

athletes = []
for s in sublist:    
    athletes+=s[0]    #getting first row from sublist

athletes = np.reshape(athletes, (30,7)) #rearranging to fit into a df

athletes_df = pd.DataFrame(athletes, columns=('Rank Bib Name Nat T Time Behind').split())

#convert behind into seconds and float
Behind = [s.lstrip('+').split(':') for s in athletes_df['Behind']] #split into min and sec
Behind = [float(i[0]) if len(i)==1 else float(i[0])*60+float(i[1]) for i in Behind] #calculate behind time in s
athletes_df['Behind']=Behind

#convert rank,bib,T(targets missed) to int
athletes_df[['Rank','Bib','T']]=athletes_df[['Rank','Bib','T']].astype(int)

In [12]:

#now course time
#course time is [-2] in sublist and I only need the final time and behind[-3:-1]

course_time = []
for s in sublist:
    course_time += s[-2].split()[-3:-1]

course_time = np.reshape(course_time, (30,2))

course_time_df = pd.DataFrame(course_time, columns=['Course Time','CT Behind'])

CTbehind = [s.lstrip('+').split(':') for s in course_time_df['CT Behind']] #split into min and sec
CTbehind = [float(i[0]) if len(i)==1 else float(i[0])*60+float(i[1]) for i in CTbehind] #calculate behind time in s

course_time_df['CT Behind'] = CTbehind

In [13]:

#finding fastest lap excluding shooting
#split string and find indices with lap times: 2,5,8,11,,14
lap_times=[]
for s in sublist:
    laps = [s[-2].split()[i] for i in [2,5,8,11,14]]
    laps = [float(i.split(':')[0])*60 + float(i.split(':')[1]) for i in laps] #converting to seconds
    lap_times.append(laps)


lap_times_df = pd.DataFrame(lap_times, columns=['Lap 1', 'Lap 2','Lap 3','Lap 4','Lap 5'])
lap_times_df['Fastest lap'] = lap_times_df.min(axis=1) #creating Fastest lap column
#calculate fastest lap behind time
lap_times_df['FL Behind (s)'] = lap_times_df['Fastest lap']-lap_times_df['Fastest lap'].min()
# and convert to fastest lap back to MM:SS string
lap_times_df['Fastest lap']=[str(t/60).split('.')[0] +':'+ str(round(t%60, 1)) for t in lap_times_df['Fastest lap']]

In [14]:

#range time

range_times =[]
for s in sublist:
    range_time = [(s[-3].split()[i]) for i in [2,5,8,11]] # 2,5,8,11 indices of interest
    range_times += range_time

    #range times to seconds and float
range_times = [float(s.split(':')[0])*60 + float(s.split(':')[1]) if len(s.split(':')) == 2 else float(s) for s in range_times]

range_df=pd.DataFrame(np.reshape(range_times, (30,4)), columns=['Range 1', 'Range 2', 'Range 3', 'Range 4'])
range_df['Total range'] = range_df.sum(axis=1)

In [15]:

#shooting times
shooting_times =[]
for s in sublist:
    shooting_time = [float(s[3].split()[i]) for i in [2,6,10,14]] #2,6,10,14 indices of interest
                        #converting straight to float
    shooting_times += shooting_time
    

shooting_df=pd.DataFrame(np.reshape(shooting_times, (30,4)), columns=['Shoot 1', 'Shoot 2', 'Shoot 3', 'Shoot 4'])
shooting_df['Total shooting'] = shooting_df.sum(axis=1)

In [16]:

#missed targets
missed_targets=[]
for s in sublist:
    targets = [s[3].split()[i] for i in [1,5,9,13]]
    missed_targets.append(targets)

missed_targets_df = pd.DataFrame(missed_targets, columns=['Shoot 1', 'Shoot 2', 'Shoot 3', 'Shoot 4'])

Now I can concatenate all the dfs into one big dataframe as the index will be the same for each of them.

In [17]:

df = pd.concat((athletes_df,
                course_time_df,
                lap_times_df[['Fastest lap','FL Behind (s)']],
                shooting_df['Total shooting'],
                range_df['Total range']), axis=1)

This is the data for the top 5:¶

- Rank: what it says on the box, a position  
- Bib: bib number  
- Name, Nationality  
- T: Missed targets (out of 20)  
- Time: Total Time  
- Behind: Time behind the winner in seconds  
- Course Time: Time excluding shooting range and penalty loops  
- CT Behind: Time behind the fastest Course Time in seconds 
- Fastest lap: Fastest lap of each athlete, excluding shooting range and penalty loops  
- FL Behind: Athlete's fastest lap compared to the best lap overall (seconds)  
- Total shooting time in seconds  
- Total range: Total time spent on the range in seconds

In [18]:

df.head()

Out[18]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
0	1	15	SIMON Julia	FRA	3	40:11.1	0.0	35:34.3	11.9	6:42.1	8.6	97.6	181.0
1	2	6	PREUSS Franziska	GER	2	40:15.0	3.9	35:49.4	27.0	6:43.3	9.8	103.7	191.5
2	3	3	OEBERG Hanna	SWE	3	40:22.8	11.7	35:42.0	19.6	6:38.9	5.4	101.0	184.2
3	4	13	TANDREVOLD Ingrid Landmark	NOR	2	40:24.1	13.0	35:44.5	22.1	6:46.3	12.8	121.4	203.2
4	5	20	BRORSSON Mona	SWE	1	40:26.0	14.9	36:21.1	58.7	6:49.5	16.0	112.1	197.8

Best shooters, 2 or less targets missed:

In [19]:

df[df['T']<=2].sort_values('T')

Out[19]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
4	5	20	BRORSSON Mona	SWE	1	40:26.0	14.9	36:21.1	58.7	6:49.5	16.0	112.1	197.8
1	2	6	PREUSS Franziska	GER	2	40:15.0	3.9	35:49.4	27.0	6:43.3	9.8	103.7	191.5
3	4	13	TANDREVOLD Ingrid Landmark	NOR	2	40:24.1	13.0	35:44.5	22.1	6:46.3	12.8	121.4	203.2
5	6	5	WIERER Dorothea	ITA	2	40:39.9	28.8	36:24.0	61.6	7:5.3	31.8	92.1	179.1
17	18	22	HETTICH Janina	GER	2	41:25.5	74.4	36:36.5	74.1	7:7.3	33.8	132.1	213.7
23	24	14	KNOTTEN Karoline Offigstad	NOR	2	42:25.9	134.8	37:47.9	145.5	7:14.1	40.6	105.3	193.1

Fastest shooters. Notice D. Wierer fastest shooting time and also one of the most accurate!

In [20]:

df.sort_values('Total shooting').head()

Out[20]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
5	6	5	WIERER Dorothea	ITA	2	40:39.9	28.8	36:24.0	61.6	7:5.3	31.8	92.1	179.1
0	1	15	SIMON Julia	FRA	3	40:11.1	0.0	35:34.3	11.9	6:42.1	8.6	97.6	181.0
2	3	3	OEBERG Hanna	SWE	3	40:22.8	11.7	35:42.0	19.6	6:38.9	5.4	101.0	184.2
1	2	6	PREUSS Franziska	GER	2	40:15.0	3.9	35:49.4	27.0	6:43.3	9.8	103.7	191.5
6	7	1	ROEISELAND Marte Olsbu	NOR	4	40:41.4	30.3	35:33.2	10.8	6:46.5	13.0	105.0	187.1

Athletes who spent the least time on the range.
That differs from the clear shooting time as it also involves approaching the targets,
getting breath under control, etc.
But yes, it is the same 5 ladies.

In [21]:

df.sort_values('Total range').head()

Out[21]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
5	6	5	WIERER Dorothea	ITA	2	40:39.9	28.8	36:24.0	61.6	7:5.3	31.8	92.1	179.1
0	1	15	SIMON Julia	FRA	3	40:11.1	0.0	35:34.3	11.9	6:42.1	8.6	97.6	181.0
2	3	3	OEBERG Hanna	SWE	3	40:22.8	11.7	35:42.0	19.6	6:38.9	5.4	101.0	184.2
6	7	1	ROEISELAND Marte Olsbu	NOR	4	40:41.4	30.3	35:33.2	10.8	6:46.5	13.0	105.0	187.1
1	2	6	PREUSS Franziska	GER	2	40:15.0	3.9	35:49.4	27.0	6:43.3	9.8	103.7	191.5

Fastest skiers overall.

In [22]:

df.sort_values('CT Behind').head()

Out[22]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
14	15	11	HERRMANN Denise	GER	4	40:56.5	45.4	35:22.4	0.0	6:43.3	9.8	127.8	210.4
8	9	12	BRAISAZ-BOUCHET Justine	FRA	4	40:43.7	32.6	35:23.2	0.8	6:45.1	11.6	120.6	205.3
10	11	25	MIRONOVA Svetlana	RUS	4	40:44.4	33.3	35:26.9	4.5	6:33.5	0.0	115.5	201.2
7	8	2	ECKHOFF Tiril	NOR	4	40:43.2	32.1	35:30.3	7.9	6:37.6	4.1	112.7	195.8
6	7	1	ROEISELAND Marte Olsbu	NOR	4	40:41.4	30.3	35:33.2	10.8	6:46.5	13.0	105.0	187.1

Comparing the fastest laps. Go team Czech Republic, woohoo!!!

In [23]:

df.sort_values('FL Behind (s)').head()

Out[23]:

	Rank	Bib	Name	Nat	T	Time	Behind	Course Time	CT Behind	Fastest lap	FL Behind (s)	Total shooting	Total range
10	11	25	MIRONOVA Svetlana	RUS	4	40:44.4	33.3	35:26.9	4.5	6:33.5	0.0	115.5	201.2
9	10	9	DAVIDOVA Marketa	CZE	3	40:44.1	33.0	35:38.1	15.7	6:36.5	3.0	124.5	207.5
7	8	2	ECKHOFF Tiril	NOR	4	40:43.2	32.1	35:30.3	7.9	6:37.6	4.1	112.7	195.8
2	3	3	OEBERG Hanna	SWE	3	40:22.8	11.7	35:42.0	19.6	6:38.9	5.4	101.0	184.2
0	1	15	SIMON Julia	FRA	3	40:11.1	0.0	35:34.3	11.9	6:42.1	8.6	97.6	181.0

How many athletes per nationality in the race.

In [24]:

df['Nat'].value_counts()

Out[24]:

GER    4
SWE    4
FRA    4
NOR    4
AUT    3
ITA    2
RUS    2
BLR    2
CAN    1
USA    1
EST    1
CZE    1
KOR    1
Name: Nat, dtype: int64

Now let's get visual¶

I learned to love distribution plots as they show so much more than a simple mean average value.

This first one shows how much time it took athletes to fire their 5 shots at each of the 4 shootings.¶

In [25]:

plt.figure(figsize=(12,7))

for column in shooting_df.columns.drop('Total shooting'): #each columns except for SUM total
    ax = sns.distplot(shooting_df[column], label=column, axlabel='Time (s)', kde_kws={'lw':4},
                     hist_kws={'alpha':0.3,'label':column})

ax.set_title('Time per shoot')
ax.set(xlim=(min(shooting_times)-0.5, max(shooting_times)+0.5))
plt.legend()

Out[25]:

<matplotlib.legend.Legend at 0x7eff4434acd0>

Now the same for the time spent at the range¶

In [26]:

plt.figure(figsize=(12,7))

for column in range_df.columns.drop('Total range'): #each columns except for SUM total
    ax = sns.distplot(range_df[column], label=column, axlabel='Time (s)', kde_kws={'lw':4},
                     hist_kws={'alpha':0.3,'label':column})
    
ax.set_title('Time spent on range')
ax.set(xlim=(min(range_times)-0.5, max(range_times)+0.5))
plt.legend()

Out[26]:

<matplotlib.legend.Legend at 0x7eff443292b0>

Targets missed at each shooting¶

Notice shootings 3&4 distributed slightly more towards the right = more mistakes.
Compared to let's say blue shoot one with mostly 0 and 1 mistakes.

In [27]:

plt.figure(figsize=(12,7))
for column in missed_targets_df.columns:
    ax=sns.distplot(missed_targets_df[column], axlabel='Targets missed', kde_kws={'lw':4},
                     hist_kws={'alpha':0.3,'label':column})

ax.set_title('Distribution of missed targets per shoot')
ax.set_xlim(-0.5,5.5)
plt.legend()

Out[27]:

<matplotlib.legend.Legend at 0x7eff43da2cd0>

Ski speed¶

And finally let's look at the distribution of time per each lap.
Here you can see the narrow blue lap one as the athletes were mostly together.
Then the curves progressively flatten as differences in physical condition and stamina start to show.
And finally the purple lap 5 out of 5 clearly the fastest lap of them all as each athlete just pushed hard towards the finish.

In [28]:

plt.figure(figsize=(12,7))

for column in lap_times_df.columns.drop(['Fastest lap','FL Behind (s)']): #each columns except for SUM total
    ax = sns.distplot(lap_times_df[column], label=column,
                     axlabel='Time (mm:ss)', kde_kws={'lw':4},
                     hist_kws={'alpha':0.3,'label':column})

ax.set(xlim=(400,485))
ax.set_title('Time distribution per lap (Excl. shooting)')
xticks = ax.get_xticks()
#convert x ticks to mm:ss
xticks = [str(tick/60).split('.')[0] +':'+ str(round(tick%60, 1)) for tick in xticks]
ax.set_xticklabels(xticks)
    
plt.legend()

Out[28]:

<matplotlib.legend.Legend at 0x7eff44265400>

And what a finish that was! You can watch it on the official IBU TV here: https://www.eurovisionsports.tv/ibu#AO7BBV3KHP

Data Potato

Friday, 22 January 2021