Analysis / Featured

On advanced stats in soccer, part 1: Navigating the soccer data desert

Photo: Daniel Gajdamowicz

I love sports. I love playing sports. I love watching sports. I love talking sports. I love reading about sports. I love the emotional highs (and even the emotional lows!) of being a sports fan.

I am a nerd. I love playing with numbers. I love playing games that require constantly analyzing the chances of certain events occurring. I love digging below the surface and doing deep information dives.

When these two sides collide, you get, well, a sports fan who views everything through the lens of the numbers. As you may have seen in my byline, I am normally a basketball writer, focusing heavily on using statistics and advanced statistics to find trends and make predictions. I began writing this article to use data to explain what happened to the Union last season, but found that it’s not such an easy task. 

The Current State of Advanced Statistics in Soccer

The lack of statistics, especially publicly available ones, in soccer is really, really depressing for stat geeks like myself. I fired an email over to Chris Sherman, PSP’s resident-in-stats, asking if there was some hidden site I was missing, some treasure trove of data, and, well, nope. It just doesn’t really exist, at least not publicly.

Soccer is not a sport that lends itself particularly easily to statistics. There are a ton of stats myself and other statisticians are just used to having that simply don’t exist in the soccer world. In baseball, I can find literally everything I want to know about a pitcher or a pitch going years back. Speed, spin, movement, results, everything in just a few clicks. Sites like Fangraphs and Baseball Reference provide incredibly-detailed searchable and sortable data. In basketball, itself provides an absolutely massive amount of data (click on “General” and “Traditional” to see all the different stat types they have), while sites like Basketball Reference and NBAWowy provide other ways to sort and view data.

In soccer… it’s quite frankly pathetic in comparison. We may be getting a Soccer Reference someday, but right now, it’s just a work in progress. There’s the paltry Stats tab at AmericanSoccerAnalysis is about the only other site with any kind of real data, but they provide “analyzed” data, not raw data, which means it’s difficult to fact check and fairly useless for anything other than reading their articles. Their main contribution to the soccer stat world right now is their publicly-available expected goals (xG) model.

xG assigns value to every shot, then adds those values together to reveal an xG value. The problem is that “expected goals” right now is basically useless.

For one, there’s no actual “expected goals” formula. Every site calculates it differently. Just compare Opta’s xG to ASA’s xG from last season. They may as well be two different stats. And we don’t really know how either is calculated! As of two years ago, you could use a blanket coefficient of .095 (just under 1 shot out of 10 is expected to result in a goal) for European leagues and it was basically just as effective as complicated models. MLS’ league-wide stats have consistently held around 1.1 (although because MLS stats are weak, I’m not sure it was quite apples to apples). While the above expected goals models would come to the conclusion that the Union had a significantly worse goal differential than expected last year, the flat method comes to the conclusion that the Union had only a slightly worse goal differential than expected, with the biggest difference coming from xG-allowed.

But there’s even another problem — the stat is entirely based on shots alone! As Union fans, you may have noticed that Andre Blake is extremely aggressive about getting to balls before a shot can be fired. Aggressive goalies are going to make a team always underperform xG, because they are both more likely to prevent shots and more likely to give up easy shots if they don’t get to the ball on time. Which is how the goalie of the year actually ends up dead last in xG allowed – G allowed (basically how many more goals than “expected” were allowed).

It also means that the statistic simply judges results, not process. If a team gets a point blank shot because an opposing goalie screws up, the xG goes way up, even if the team did absolutely nothing to create the chance. By measuring result instead of process, it’s difficult to use it as a reliable projection.

BUT WAIT, THERE’S MORE! On top of everything else, we’re dealing with fairly small sample sizes, especially in relation to goalies. 400-500 shots per team per season, with xG in the mid two digits. That means that a few bad misses or outrageous goals can majorly skew the data. What this ultimately reveals is that the usefulness of xG boils down to “you can be confident that, if your team is really far away from 1.1 goals scored/allowed per 10 shots, the rate will not keep up”, but that’s about it.

And that’s basically it for advanced stats. Possession is literally just “passes completed,” which means that if you pass back and forth among the back line for a few minutes each game, you can win the possession battle by a lot! But you’re not doing anything useful. There was a fantastic article about how time of possession really should be calculated, but those numbers really aren’t available.

Why We Need More Advanced Statistics

This lack of stats manifests itself in a frustrating inability to analyze anything. Let me give a famous (well, in the statistical baseball community) example of how knowledge can be used and applied:

In 2010, a player by the name of Aaron Hill batted .205 for the season. He had never batted below .263 prior to that season, and he hasn’t batted below .230 since. In soccer, when there is a sharp change in stats (say, something like a striker forgetting how to score or suddenly scoring  way more), there is no real analysis that can be applied other than watching and guessing. In baseball, there is data to explain such phenomena.

In Hill’s case, his “BABIP” (Batting Average on Balls in Play — the soccer equivalent would be “percentage of shots on target that score”) cratered from a previous low of .288 to .196, the lowest BABIP since at least 1945. That alone explained his drop of average. But you can dig even deeper to see why it cratered! “Hit types” (in soccer, think types of shots like header, PK, tap-in, shot from outside box, etc.) heavily influence the model, and his percentage of contact that resulted in high-BABIP hit types plummeted while contact that resulted in low-BABIP hit types soared. We can clearly see why his average sunk.

At a team level in baseball, stats such as “cluster luck” can help explain why teams are over or underperforming expectations, and provide predictive value because hit sequencing (the order in which hits or outs are made) is random. And there are decades of analyzed data that has determined the expected value of each unique event. Soccer’s xG simply does not come close to the tools available in baseball.

Soccer statistical analysis is stuck somewhere between primitive and non-existent. As a slight aside, it’s worth noting that Earnie Stewart was probably exposed to these concepts when he hired Oakland A’s GM Billy Beane for a short time while at AZ Alkmaar.

Soccer fans have very little data to work off so we can’t really draw any type of solid conclusion to the reason a team is struggling or soaring beyond “they have good/bad players” or “they’re playing well/poorly.” I had to build a spreadsheet manually to do even cursory data analysis.

I wanted to take a look at last year to see if there was an easy explanation as to why the Union collapsed down the stretch, and it turns out there’s no way to figure that out with any kind of certainty — but I took my best shot. The accepted reason is that without Nogueira, the team simply could not create in attack and suffered in defense. Was that true or was there some other reason?

Find out in Part Two tomorrow!


  1. Good stuff Adam!

  2. Banterlytics says:

    On Possession. It’s an explored area so much work can be done but issue is how would data provider act if we/someone shares interesting work (by scraping data)

  3. Andy Muenz says:

    One reason that the Union collapsed down the stretch last year was that the schedule caught up with them and was a lot more difficult in the second half.
    They played Chicago during the International break which meant no Blake or Bedoya. They played a strong Montreal team to a draw. They had the three game road trip to Portland, Toronto, and NJ. Finally, two home games against Orlando and NYRB (on a really long unbeaten streak). The only game there we were really “expecting” to win was Orlando. Other than that, the Union was under strength or playing against teams that were just better than they were.

  4. Old Soccer Coach says:

    Good stuff! I would flunk the quiz, but very good stuff.
    To state the obvious, but guys like me need that, baseball is a series of discrete, separate events, 125 of them if Lefty is pitching a complete game.
    So is American football.
    Basketball is a flow punctuated by stops, “feeling” to this humanities guy much less discrete than the other two.
    Soccer is also a punctuated flow, but I want to say it is less so than hoops. I cannot Prove that.
    How long does a baseball player run continuously? I heard Paul Brown once say it was eight seconds for football. In soccer and basketball it is longer than that.
    Statistics I have heard described as part Of Discrete mathematics. Our math department had a course so titled. There was also something called Functions that
    Study phenomena which are continuous, given Newtown’s brilliant finesse of division by zero. Yes there was Liebnitz and the Chinese Guy whose Name I never remember, too.
    Thank you Adam for teaching me.

    • Adam Schorr says:

      Soccer is most like Hockey, but the subject of Hockey Advanced Statistics is a whole ‘nother beast. Soccer and basketball are actually very similar, but it is easier to score in basketball and there are many more opportunities. I will say that I believe that many soccer teams could benefit by examining how basketball teams play. The games are played more similar than they may appear.

Leave a Reply

Your email address will not be published. Required fields are marked *