`fighter.df <- readRDS("~/ADM/fighters.RDS")`

I pulled this data from ufcstats.com using rvest and the chrome SelectorGadget extension. See fighter_scrape.R for the code. The data set only contains fighters that have fought in the UFC, which is generally considered the premier mixed martial arts promotion in the world. The data was scraped on Sunday May 5th, and thus does not contain any changes that have been posted after that point (the last event that was in the dataset was Fight Night 151 Cowboy vs Iaquinta).

I collected 24 columns of information - all thirteen stats formally listed on the ufcstats fighter pages, as well as their wins and losses from their record, total fights, win ratio, the number of fights they fought in the UFC, ufc wins, losses, and win ratio, and finally number of UFC wins by Knockout or Technical Knockout, number by submission, and number by Decision. See any fighter page for abbreviated column heading descriptions. The data initially included 3256 fighters, however that is trimmed down to 1180 when removing incomplete cases. For all of my analysis this will be further trimmed down to 793 fighters who have 5 or more fights in the UFC. One small problem is that I am not able to tell if a fighter is male or female based on the data (though if they are in a weight class of 155lb or higher they are male, and if they are in the 115lb weight class they are female - shared weight classes are 125,35,45).

`dim(fighter.df)`

`## [1] 1180 24`

```
ufc.df <- fighter.df[fighter.df$nUfcFights>=5,]
nrow(ufc.df)
```

`## [1] 793`

```
weight_classes <- ufc.df %>%
group_by(weight)%>%
summarise(count= length(weight))
weight_classes
```

```
## # A tibble: 9 x 2
## weight count
## <dbl> <int>
## 1 115 22
## 2 125 46
## 3 135 93
## 4 145 81
## 5 155 151
## 6 170 150
## 7 185 109
## 8 205 78
## 9 265 63
```

Interestingly, the 155lb and 170lb weight classes are vastly larger than any others. I would attribute this to two causes: men of average height are most likely to fall in this weight range, and fighers will fight in the lowest weight class they can to try to have a size advantage on their opponent.

As somewhat of an afterthought to this analysis (I’ve done this after pretty much everything else is done), I realized it might be interesting to record reach advantage compared to weight class average rather than pure reach. Another way to approach this could be reach minus height (also known as ape index). However, I prefer to compare reach to weight class average.

```
average.reach <- ufc.df %>%
group_by(weight) %>%
summarize(weight.reach = mean(reach))
average.height <- ufc.df %>%
group_by(weight) %>%
summarize(weight.reach = mean(height))
average.reach
```

```
## # A tibble: 9 x 2
## weight weight.reach
## <dbl> <dbl>
## 1 115 63.9
## 2 125 66.4
## 3 135 68.3
## 4 145 70.2
## 5 155 71.2
## 6 170 73.2
## 7 185 74.7
## 8 205 75.9
## 9 265 77.4
```

`average.height`

```
## # A tibble: 9 x 2
## weight weight.reach
## <dbl> <dbl>
## 1 115 63.6
## 2 125 65.8
## 3 135 66.8
## 4 145 68.6
## 5 155 69.5
## 6 170 71.3
## 7 185 72.8
## 8 205 73.7
## 9 265 74.9
```

```
avg.reach <- data.matrix(average.reach)
rownames(avg.reach) <- t(average.reach[,1])
indiv.reach <- apply(data.matrix(ufc.df$weight),1,function(x) avg.reach[which(x==avg.reach[,1]),2])
head(indiv.reach)
```

`## [1] 77.36508 74.74312 63.90909 74.74312 73.22000 68.27957`

`ufc.df$reach.diff <- ufc.df$reach-indiv.reach`

As expected, the most dominant fighter of all time, Jon Jones, has the longest reach differential, 8.1 inches!!!

`ufc.df$name[which.max(ufc.df$reach.diff)]`

`## [1] "Jon Jones"`

`ufc.df$reach[which.max(ufc.df$reach.diff)]`

`## [1] 84`

`ufc.df$reach.diff[which.max(ufc.df$reach.diff)]`

`## [1] 8.102564`

winningness by reach a large reach advantage?

```
large.reach <- ufc.df$reach.diff > 2
summary(lm(overallWinRatio ~ large.reach ,data=ufc.df))
```

```
##
## Call:
## lm(formula = overallWinRatio ~ large.reach, data = ufc.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36633 -0.06059 -0.00204 0.06145 0.28367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.716330 0.003706 193.306 <2e-16 ***
## large.reachTRUE 0.011775 0.008436 1.396 0.163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09375 on 791 degrees of freedom
## Multiple R-squared: 0.002457, Adjusted R-squared: 0.001196
## F-statistic: 1.948 on 1 and 791 DF, p-value: 0.1632
```

Not really…

Let’s go ahead and do PCA with all our numeric responses and see why this is not ideal. First we’ll remove non-numeric columns and scale the data.

`names(ufc.df)`

```
## [1] "name" "wins" "losses"
## [4] "totalfights" "overallWinRatio" "height"
## [7] "weight" "reach" "stance"
## [10] "dob" "slpm" "str.acc."
## [13] "sapm" "str.def" "tdavg."
## [16] "tdacc." "tddef." "sub.avg."
## [19] "nUfcFights" "ufcWins" "ufcWinRatio"
## [22] "byKO" "bySub" "byDec"
```

```
ufc.df1 <- ufc.df[,c(-1,-9)]
ufc.df1 <- data.frame(scale(ufc.df1))
rownames(ufc.df1) <- ufc.df$name
```

```
ufc.pca <- prcomp(ufc.df1)
fviz_pca_var(ufc.pca)
```

So what’s wrong here? Well, a lot of the constructed variables are non-unique, that is they share some information with other columns. For instance number of UFC wins is included in total wins, which is included in total fights, which is included in overallWinRatio. Another example is that height, reach, and weight are all closely related. So, lets trim down the columns to avoid this. I will remove wins and losses, overallWinRatio, weight, height & reach,, and make byKO, bySub and byDec percentages relative to total ufc wins. I will also remove nUfcFights, ufcWins, and ufcWinRatio. Hopefuly in doing so we can get groupings that are true to the fighters fighting style rather than influenced by winningness, or weight class.

`names(ufc.df)`

```
## [1] "name" "wins" "losses"
## [4] "totalfights" "overallWinRatio" "height"
## [7] "weight" "reach" "stance"
## [10] "dob" "slpm" "str.acc."
## [13] "sapm" "str.def" "tdavg."
## [16] "tdacc." "tddef." "sub.avg."
## [19] "nUfcFights" "ufcWins" "ufcWinRatio"
## [22] "byKO" "bySub" "byDec"
## [25] "reach.diff"
```

```
ufc.df2 <- ufc.df[,c(-1,-4,-5,-7:-9)]
ufc.df2$byKO <- ufc.df2$byKO / ufc.df2$ufcWins
ufc.df2$bySub <- ufc.df2$bySub / ufc.df2$ufcWins
ufc.df2$byDec <- ufc.df2$byDec / ufc.df2$ufcWins
ufc.df2 <- ufc.df2[,c(-13:-15)]
ufc.df2$totalfights <- ufc.df2$wins+ufc.df2$losses
ufc.df2 <- ufc.df2[c(-1,-2,-4,-14)]
names(ufc.df2)
```

```
## [1] "height" "slpm" "str.acc." "sapm" "str.def"
## [6] "tdavg." "tdacc." "tddef." "sub.avg." "byKO"
## [11] "byDec" "reach.diff" "totalfights"
```

```
ufc.df2 <- ufc.df2[,-1]
ufc.df2 <- data.frame(scale(ufc.df2))
rownames(ufc.df2) <- ufc.df$name
```

Now let’s do PCA on this reduced dataset with less overlapping data.

```
ufc.pca <- prcomp(ufc.df2)
fviz_pca_var(ufc.pca)
```

As a domain expert, the two directions I would pull out of this plot follow the axes in this case (that’s nice!). This is a little bit hard to tell because some of the variables contribute in two directions. This will be explained below. The x-axis represents preferred method of fighting, with stand-up fighters on the right, represented bystrikes landed and absorbed, wins by KO, take down defence in this direction. In the negative x-direction is ground fighters, represented by wins by submission and takedown. On the y-axis is how fighters win, by finish (a finish is KO/TKO or submission) in the positive y-direction, represented again by submission average, percent wins by KO, strike accuracy, and interestingly reach differential (which does make sense). In the negative y-direction are fighters than win by decision prominently, and those with good take down and strike defence, as well as those with many takedowns.

```
rots <- ufc.pca$rotation
ufc.pca1<- as.matrix(ufc.df2) %*% rots
ufc.pca1 <- data.frame(ufc.pca1)
dim(ufc.pca1)
```

`## [1] 793 12`

```
pca1 <- ufc.pca1[,1]
pca2 <- ufc.pca1[,2]
pca3 <- ufc.pca1[,3]
```

```
k.max <- 15
wss <- sapply(1:k.max, function(x){ kmeans(ufc.df2,x,nstart=30,iter.max=20)$tot.withinss } )
data.frame(x=1:k.max,y=wss)%>%
ggplot(aes(x=x,y=y))+geom_point()+geom_line()
```

`K <- 4`

In my opinion the optimal number of groups looks to be between 4 and 8. We will go with 4 groups to be conservative.

```
ufc.km <- kmeans(ufc.df2,K,iter.max=20,nstart=30)
ufc.km1 <- kmeans(ufc.pca1,K,iter.max=20,nstart=30)
table(ufc.km$cluster,ufc.km1$cluster)
```

```
##
## 1 2 3 4
## 1 0 0 0 195
## 2 0 0 244 0
## 3 197 0 0 0
## 4 0 157 0 0
```

With and without PCA the two clusterings are in total agreement. Perfect!

```
ufc.df2$cluster.km<- ufc.km$cluster
ufcCluster.df <- data.frame(pca1,pca2,cluster=factor(ufc.df2$cluster.km),name=ufc.df$name,weight=as.factor(ifelse(ufc.df$weight<155,1,ifelse(ufc.df$weight<205,2,3))))
```

Here is a plot of all fighters based on these directions, with stand-outs names marked.

```
gp2 <- ggplot(ufcCluster.df,aes(pca1,pca2,color=cluster,shape=weight))+
geom_point(size=1)+
#geom_text_repel(aes(label=name),size=3)+
geom_text_repel(data = subset(ufcCluster.df, pca1 < -4), aes(label = name))+
geom_text_repel(data = subset(ufcCluster.df, pca1 > 4), aes(label = name))+
geom_text_repel(data = subset(ufcCluster.df, pca2 > 4.25), aes(label = name))+
geom_text_repel(data = subset(ufcCluster.df, pca2 < -3.5), aes(label = name))+
ggtitle("UFC Cluster Analsys and PCA")
gp2
```

Let’s facet wrap by weight and see if we really did achieve a model that is invariant of weight classes.

```
ufcCluster.df$weight <- ufc.df$weight
ggplot(ufcCluster.df,aes(pca1,pca2,color=cluster))+
geom_point(size=1)+
facet_wrap(~ weight, ncol=3)+
ggtitle("UFC Cluster Analsys and PCA")
```

Looks like there is still some bias between weight classes, but perhaps this is to be expected. There tend to be more knockout punchers in heavier weight classes which makes sense because weight tends to transfer well to one punch knockout power in MMA, with Derrick “The Black Beast” Lewis being a very good example of this.

Doing the K-means cluster analysis using PCA, we can see that the data relatively splits along the four quadrants, where, in order of quadrants we have (1) hard-hitting knockout artists, (2) grapplers and submission artists, (3) wrestlers who dominate on top for decision victories and (4) evasive strikers who go to decision. Let’s plot this with top 10 fighters and other notables in each division to see if this theory continues to hold.

```
notable_fighters <- read.csv("~/ADM/notable_fighters.txt")
notable_fighters <- apply(notable_fighters,2,as.character)
nf.df <- ufc.df2[notable_fighters,]
indices <- as.numeric(lapply(notable_fighters, function(n) which(ufc.df$name==n))) %>% na.omit
nf.df <- nf.df[complete.cases(nf.df),]
nrow(nf.df)
```

`## [1] 70`

The fighters whose names are in the notable fighters documents are more well known recent fighters (particularly who I am familiar with so that I can interpret this more accurately). They are all currently or formerly ranked in the top 10 of their weight division. This set includes all division champions and most top 5 with a few others thrown in. Below let’s try reclustering this group on it’s own, and seeing how this compares to how they were clustered with all the others.

```
nf.df$pca1 <- pca1[indices]
nf.df$pca2 <- pca2[indices]
nf.df$name <- ufc.df$name[indices]
weights <- ufc.df[indices,]$weight
nf.df$weight <- weights
gp1 <- ggplot(nf.df,aes(pca1,pca2,color=as.factor(cluster.km)))+
geom_point(size=1)+
geom_text_repel(aes(label=name),size=3)+
guides(color=F)
gp1 + ggtitle("UFC Cluster Analysis and PCA Select Fighter Clustering")
```