fighter.df <- readRDS("~/ADM/fighters.RDS")

Scraping the data

I pulled this data from ufcstats.com using rvest and the chrome SelectorGadget extension. See fighter_scrape.R for the code. The data set only contains fighters that have fought in the UFC, which is generally considered the premier mixed martial arts promotion in the world. The data was scraped on Sunday May 5th, and thus does not contain any changes that have been posted after that point (the last event that was in the dataset was Fight Night 151 Cowboy vs Iaquinta).

Columns

I collected 24 columns of information - all thirteen stats formally listed on the ufcstats fighter pages, as well as their wins and losses from their record, total fights, win ratio, the number of fights they fought in the UFC, ufc wins, losses, and win ratio, and finally number of UFC wins by Knockout or Technical Knockout, number by submission, and number by Decision. See any fighter page for abbreviated column heading descriptions. The data initially included 3256 fighters, however that is trimmed down to 1180 when removing incomplete cases. For all of my analysis this will be further trimmed down to 793 fighters who have 5 or more fights in the UFC. One small problem is that I am not able to tell if a fighter is male or female based on the data (though if they are in a weight class of 155lb or higher they are male, and if they are in the 115lb weight class they are female - shared weight classes are 125,35,45).

Exploratory Data Analysis

dim(fighter.df)
## [1] 1180   24
ufc.df <- fighter.df[fighter.df$nUfcFights>=5,]
nrow(ufc.df)
## [1] 793
weight_classes <- ufc.df %>%
  group_by(weight)%>%
  summarise(count= length(weight))

weight_classes
## # A tibble: 9 x 2
##   weight count
##    <dbl> <int>
## 1    115    22
## 2    125    46
## 3    135    93
## 4    145    81
## 5    155   151
## 6    170   150
## 7    185   109
## 8    205    78
## 9    265    63

Interestingly, the 155lb and 170lb weight classes are vastly larger than any others. I would attribute this to two causes: men of average height are most likely to fall in this weight range, and fighers will fight in the lowest weight class they can to try to have a size advantage on their opponent.

As somewhat of an afterthought to this analysis (I’ve done this after pretty much everything else is done), I realized it might be interesting to record reach advantage compared to weight class average rather than pure reach. Another way to approach this could be reach minus height (also known as ape index). However, I prefer to compare reach to weight class average.

average.reach <- ufc.df %>%
  group_by(weight) %>%
  summarize(weight.reach = mean(reach))

average.height <- ufc.df %>%
  group_by(weight) %>%
  summarize(weight.reach = mean(height))

average.reach
## # A tibble: 9 x 2
##   weight weight.reach
##    <dbl>        <dbl>
## 1    115         63.9
## 2    125         66.4
## 3    135         68.3
## 4    145         70.2
## 5    155         71.2
## 6    170         73.2
## 7    185         74.7
## 8    205         75.9
## 9    265         77.4
average.height
## # A tibble: 9 x 2
##   weight weight.reach
##    <dbl>        <dbl>
## 1    115         63.6
## 2    125         65.8
## 3    135         66.8
## 4    145         68.6
## 5    155         69.5
## 6    170         71.3
## 7    185         72.8
## 8    205         73.7
## 9    265         74.9
avg.reach <- data.matrix(average.reach)
rownames(avg.reach) <- t(average.reach[,1])

indiv.reach <- apply(data.matrix(ufc.df$weight),1,function(x) avg.reach[which(x==avg.reach[,1]),2])
head(indiv.reach)
## [1] 77.36508 74.74312 63.90909 74.74312 73.22000 68.27957
ufc.df$reach.diff <- ufc.df$reach-indiv.reach

As expected, the most dominant fighter of all time, Jon Jones, has the longest reach differential, 8.1 inches!!!

ufc.df$name[which.max(ufc.df$reach.diff)]
## [1] "Jon Jones"
ufc.df$reach[which.max(ufc.df$reach.diff)]
## [1] 84
ufc.df$reach.diff[which.max(ufc.df$reach.diff)]
## [1] 8.102564

winningness by reach a large reach advantage?

large.reach <- ufc.df$reach.diff > 2
summary(lm(overallWinRatio ~ large.reach ,data=ufc.df))
## 
## Call:
## lm(formula = overallWinRatio ~ large.reach, data = ufc.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36633 -0.06059 -0.00204  0.06145  0.28367 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.716330   0.003706 193.306   <2e-16 ***
## large.reachTRUE 0.011775   0.008436   1.396    0.163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09375 on 791 degrees of freedom
## Multiple R-squared:  0.002457,   Adjusted R-squared:  0.001196 
## F-statistic: 1.948 on 1 and 791 DF,  p-value: 0.1632

Not really…

PCA and K-Means Clustering

Let’s go ahead and do PCA with all our numeric responses and see why this is not ideal. First we’ll remove non-numeric columns and scale the data.

names(ufc.df)
##  [1] "name"            "wins"            "losses"         
##  [4] "totalfights"     "overallWinRatio" "height"         
##  [7] "weight"          "reach"           "stance"         
## [10] "dob"             "slpm"            "str.acc."       
## [13] "sapm"            "str.def"         "tdavg."         
## [16] "tdacc."          "tddef."          "sub.avg."       
## [19] "nUfcFights"      "ufcWins"         "ufcWinRatio"    
## [22] "byKO"            "bySub"           "byDec"
ufc.df1 <- ufc.df[,c(-1,-9)]
ufc.df1 <- data.frame(scale(ufc.df1))
rownames(ufc.df1) <- ufc.df$name
ufc.pca <- prcomp(ufc.df1)
fviz_pca_var(ufc.pca)

So what’s wrong here? Well, a lot of the constructed variables are non-unique, that is they share some information with other columns. For instance number of UFC wins is included in total wins, which is included in total fights, which is included in overallWinRatio. Another example is that height, reach, and weight are all closely related. So, lets trim down the columns to avoid this. I will remove wins and losses, overallWinRatio, weight, height & reach,, and make byKO, bySub and byDec percentages relative to total ufc wins. I will also remove nUfcFights, ufcWins, and ufcWinRatio. Hopefuly in doing so we can get groupings that are true to the fighters fighting style rather than influenced by winningness, or weight class.

names(ufc.df)
##  [1] "name"            "wins"            "losses"         
##  [4] "totalfights"     "overallWinRatio" "height"         
##  [7] "weight"          "reach"           "stance"         
## [10] "dob"             "slpm"            "str.acc."       
## [13] "sapm"            "str.def"         "tdavg."         
## [16] "tdacc."          "tddef."          "sub.avg."       
## [19] "nUfcFights"      "ufcWins"         "ufcWinRatio"    
## [22] "byKO"            "bySub"           "byDec"          
## [25] "reach.diff"
ufc.df2 <- ufc.df[,c(-1,-4,-5,-7:-9)]
ufc.df2$byKO <- ufc.df2$byKO / ufc.df2$ufcWins
ufc.df2$bySub <- ufc.df2$bySub / ufc.df2$ufcWins
ufc.df2$byDec <- ufc.df2$byDec / ufc.df2$ufcWins
ufc.df2 <- ufc.df2[,c(-13:-15)]
ufc.df2$totalfights <- ufc.df2$wins+ufc.df2$losses
ufc.df2 <- ufc.df2[c(-1,-2,-4,-14)]

names(ufc.df2)
##  [1] "height"      "slpm"        "str.acc."    "sapm"        "str.def"    
##  [6] "tdavg."      "tdacc."      "tddef."      "sub.avg."    "byKO"       
## [11] "byDec"       "reach.diff"  "totalfights"
ufc.df2 <- ufc.df2[,-1]
ufc.df2 <- data.frame(scale(ufc.df2))
rownames(ufc.df2) <- ufc.df$name

Now let’s do PCA on this reduced dataset with less overlapping data.

ufc.pca <- prcomp(ufc.df2)
fviz_pca_var(ufc.pca)

As a domain expert, the two directions I would pull out of this plot follow the axes in this case (that’s nice!). This is a little bit hard to tell because some of the variables contribute in two directions. This will be explained below. The x-axis represents preferred method of fighting, with stand-up fighters on the right, represented bystrikes landed and absorbed, wins by KO, take down defence in this direction. In the negative x-direction is ground fighters, represented by wins by submission and takedown. On the y-axis is how fighters win, by finish (a finish is KO/TKO or submission) in the positive y-direction, represented again by submission average, percent wins by KO, strike accuracy, and interestingly reach differential (which does make sense). In the negative y-direction are fighters than win by decision prominently, and those with good take down and strike defence, as well as those with many takedowns.

rots <- ufc.pca$rotation
ufc.pca1<- as.matrix(ufc.df2) %*% rots
ufc.pca1 <- data.frame(ufc.pca1)
dim(ufc.pca1)
## [1] 793  12
pca1 <- ufc.pca1[,1]
pca2 <- ufc.pca1[,2]
pca3 <- ufc.pca1[,3]
k.max <- 15
wss <- sapply(1:k.max, function(x){ kmeans(ufc.df2,x,nstart=30,iter.max=20)$tot.withinss } )
data.frame(x=1:k.max,y=wss)%>%
  ggplot(aes(x=x,y=y))+geom_point()+geom_line()

K <- 4

In my opinion the optimal number of groups looks to be between 4 and 8. We will go with 4 groups to be conservative.

ufc.km <- kmeans(ufc.df2,K,iter.max=20,nstart=30)
ufc.km1 <- kmeans(ufc.pca1,K,iter.max=20,nstart=30)
table(ufc.km$cluster,ufc.km1$cluster)
##    
##       1   2   3   4
##   1   0   0   0 195
##   2   0   0 244   0
##   3 197   0   0   0
##   4   0 157   0   0

With and without PCA the two clusterings are in total agreement. Perfect!

ufc.df2$cluster.km<- ufc.km$cluster
ufcCluster.df <- data.frame(pca1,pca2,cluster=factor(ufc.df2$cluster.km),name=ufc.df$name,weight=as.factor(ifelse(ufc.df$weight<155,1,ifelse(ufc.df$weight<205,2,3))))

Here is a plot of all fighters based on these directions, with stand-outs names marked.

gp2 <- ggplot(ufcCluster.df,aes(pca1,pca2,color=cluster,shape=weight))+
  geom_point(size=1)+
  #geom_text_repel(aes(label=name),size=3)+
  geom_text_repel(data = subset(ufcCluster.df, pca1 < -4), aes(label = name))+
  geom_text_repel(data = subset(ufcCluster.df, pca1 > 4), aes(label = name))+
  geom_text_repel(data = subset(ufcCluster.df, pca2 > 4.25), aes(label = name))+
  geom_text_repel(data = subset(ufcCluster.df, pca2 < -3.5), aes(label = name))+
  ggtitle("UFC Cluster Analsys and PCA")

gp2

Let’s facet wrap by weight and see if we really did achieve a model that is invariant of weight classes.

ufcCluster.df$weight <- ufc.df$weight

ggplot(ufcCluster.df,aes(pca1,pca2,color=cluster))+
  geom_point(size=1)+
  facet_wrap(~ weight, ncol=3)+
  ggtitle("UFC Cluster Analsys and PCA")

Looks like there is still some bias between weight classes, but perhaps this is to be expected. There tend to be more knockout punchers in heavier weight classes which makes sense because weight tends to transfer well to one punch knockout power in MMA, with Derrick “The Black Beast” Lewis being a very good example of this.

Doing the K-means cluster analysis using PCA, we can see that the data relatively splits along the four quadrants, where, in order of quadrants we have (1) hard-hitting knockout artists, (2) grapplers and submission artists, (3) wrestlers who dominate on top for decision victories and (4) evasive strikers who go to decision. Let’s plot this with top 10 fighters and other notables in each division to see if this theory continues to hold.

notable_fighters <- read.csv("~/ADM/notable_fighters.txt")
notable_fighters <- apply(notable_fighters,2,as.character)

nf.df <- ufc.df2[notable_fighters,]

indices <- as.numeric(lapply(notable_fighters, function(n) which(ufc.df$name==n))) %>% na.omit

nf.df <- nf.df[complete.cases(nf.df),]
nrow(nf.df)
## [1] 70

The fighters whose names are in the notable fighters documents are more well known recent fighters (particularly who I am familiar with so that I can interpret this more accurately). They are all currently or formerly ranked in the top 10 of their weight division. This set includes all division champions and most top 5 with a few others thrown in. Below let’s try reclustering this group on it’s own, and seeing how this compares to how they were clustered with all the others.

nf.df$pca1 <- pca1[indices]
nf.df$pca2 <- pca2[indices]

nf.df$name <- ufc.df$name[indices]
weights <- ufc.df[indices,]$weight
nf.df$weight <- weights

gp1 <- ggplot(nf.df,aes(pca1,pca2,color=as.factor(cluster.km)))+
  geom_point(size=1)+
  geom_text_repel(aes(label=name),size=3)+
  guides(color=F)

gp1 + ggtitle("UFC Cluster Analysis and PCA Select Fighter Clustering")