Mastering (fantasy) football analytics - Part 1: Fuzzy joining datasets with AWS Athena

Mastering (fantasy) football analytics - Part 1: Fuzzy joining datasets with AWS Athena
A cartoon about football, by Tom Gauld.

Introduction

In the late hours of a Wednesday night in my dimly lit apartment, I find myself immersed in the gentle glow of my computer screen, its eerie blue hue casting a captivating spell. Like the rolling credits of an old film reel, rows of data gracefully scroll by. In the dwindling hours before the transfer period of my “Mon Petit Gazon (MPG)” football fantasy league concludes, I sit here, somewhat hunched over, as the rhythmic intervals of the command line terminal cast a fleeting dance of illumination all across the room; with each passing line a step closer to deciphering the algorithmic solution to my decisive victory: I’m on the quest for the ultimate championship-worthy fantasy football roster. Every player’s transfer move, each pass, and every goal occupy my focus. This is more than just a game flashing across the screen; it’s an intricate tapestry of statistics, a trove of hidden insights waiting to be unearthed.

Not long ago, I concluded a transformative voyage that lasted two and a half years – a journey of crafting technology on the AWS platform at the service of my most beloved football club. I’ve learned a lot from these days, and I left with the ambition of bringing all this knowledge to the light of other clubs, fostering innovation on AWS in the sports industry. Some habits of my old days persist, however: I approach sports analytics with a zeal to secure victory, regardless of the stakes. My modus operandi remains unshakable – data is most definitely the key.


A slight digression, but have you ever wondered about the weight that data & analytics hold in the multi-million euro decisions that govern European football transfers? The reality will probably disappoint you: Data in (european) sports is widely fantasized and scenes of analysts grinding data away all night long in the clubs’ backrooms, MoneyBall-style, are more “cinematic” than “authentic”. In fact, the role of data in football in 2023 remains in its infancy, and despite their continental acclaim and stature, most European clubs are somewhat analogous to startup companies in regard to technical resources and reach: Their exploration into the world of data remains cautious at best, held back by constraints, lack of institutional conviction, or inability to scale up their processes.

This brings us back to my initial contention: Envision yourself, a data enthusiast, in control of your very own football club and entrusted with assembling a player roster within the confines of a 500M€ budget. Who’d you gamble on? Upon which metric would you anchor your decisive choices? But more importantly, where would your data journey start?


This puzzle occupies the thoughts of many sports club officials I collaborate with on a daily basis; and where this blog post comes into the picture.

“Mon Petit Gazon (MPG)” is an online fantasy football platform that has gained remarkable popularity, here in France, over the last five years or so. (I know that fantasy sports have been a staple of the American sporting scene since the 1980s, but we, the French, possess a unique affinity for savoring the journey.) With a new season of fantasy football coming, I want to take on the opportunity to delve into some intricate football engineering challenges, building (fantasy) football analytics on a budget, and show examples of how to leverage the full potential of AWS Data & Analytics services, to address any industry needs.

Through tangible examples drawn from real-world football scenarios, my intent is to offer guidance for those eager to step into the realm of data analytics, and plan to explore the spectrum of all essential stages required to build a robust and successful data strategy:

  1. Crafting insightful data visualizations to unearth the latent value nestled within your data’s depths and generate an immediate return on investment;
  2. Progressing to more advanced data analysis methodologies and tactics;
  3. Architecting purpose-built data frameworks to accurately address industry inquiries on a larger scale ;
  4. Advancing further to build sophisticated models that elevate decision-making by applying Machine-Learning-powered automation.

Embarking on this odyssey, we set sail today with a first football tale that explores strategies of data normalization and preparation through the prism of Athena. By crafting meaningful relationships between datasets, we will unlock the dormant value within data to fuel our data-driven strategies. And, as far as I am concerned, setting the stepping stone to my fantasy football championship dominance.

Play along!
You can access all the SQL expressions discussed below, through a dedicated GitHub Gist. This repository serves as a valuable resource, enabling you to delve deeper into the intricacies of each SQL expression and apply them to your own projects. Feel free to explore and experiment with these expressions to enhance your data preparation capabilities.

Creating meaningful relationships between datasets

Just as the European football transfer period is coming to an end, the first issue on my horizon is to lay the ground for building the ultimate fantasy championship-winning roster.

Imagine a vast player pool, with all participants from the French football championship, each bearing a valuation derived from their recent on-pitch performances; and an initial budget of 500M€. The challenge then consists of striking a delicate balance between evaluating and bidding on undervalued players: underdogs whose game quotations might soar over the season, leading to long-run positive returns; and securing, at the expense of aggressive investments, top-performing players to ensure immediate results on the field, and secure the championship win.

To help you with making these decisions, the game exposes a public data API that isn’t that hard to find for the discerning seeker: With a knowledgeable eye, anyone can access the entire player roster, yielding a rich tapestry of aggregated game statistics and player valuations:

idfirstnamelastnamepositionultrapositionquotationclubidstats
mpg_championship_player_607088NathanZézé2201mpg_championship_club_430{ … }
mpg_championship_player_489230Hianga’aM’Bock3301mpg_championship_club_862{ … }
mpg_championship_player_484842HákonHaraldsson3308mpg_championship_club_429{ … }
mpg_championship_player_483867ÉtienneYouté2207mpg_championship_club_141{ … }
mpg_championship_player_511596Pape4401mpg_championship_club_141{ … }
mpg_championship_player_223140MathiasMathias Pereira Lage3316mpg_championship_club_862{ … }
mpg_championship_player_492025MuhammedCham33113mpg_championship_club_1983{ … }
mpg_championship_player_220160KylianMbappé44032mpg_championship_club_149{ … }
mpg_championship_player_59966AlexandreLacazette44033mpg_championship_club_143{ … }
mpg_championship_player_42779KevinGameiro4409mpg_championship_club_153{ … }
An extract of the "mpg_playerpools" dataset (538 player records).

However, the value that a solitary dataset can unveil remains inherently limited, regardless of its vastness or dimensions. Put another way, the strategic advantage you can gain over your competitors through openly accessible data relationships will only take you so far. Instead, the value of making informed data-driven decisions for your organization will primarily reside in your capability to interconnect diverse data sources that converge upon the same data points, all on a grand scale. That’s the original definition for Big Data.

Think of it as your Sales and Marketing department extracting all e-commerce orders per user but correlated with their engagements on your social media platforms. Or envision football scouts analyzing the projected market value of a prospect, aligned with their recent on-field performances. “We don’t wear the same jersey but share the same (data-driven) passion.”


Regarding our primary issue, I set myself the objective to enrich “MPG” data with insights harvested from a popular fan-centric data hub, namely Transfermarkt.

Transfermarkt stands as the Wikipedia of European football knowledge, providing a comprehensive platform for many football enthusiasts and industry professionals to access a wide range of data related to football clubs and players. I will abstain from giving too much information on how to get your hands on that data at scale - the answer lies in some gray area - but let’s just say that this is nothing some “Gorgeous Broth” cannot solve for you.

idurlfirst_namelast_namealiasbirth_datemarket_value
923832https://www.transfermarkt.de/scoutastic/profil/spieler/923832NathanZeze2005-06-18T00:00:00+02:00400000
684062https://www.transfermarkt.de/scoutastic/profil/spieler/684062Hianga’aMbock1999-12-28T00:00:00+01:001200000
652275https://www.transfermarkt.de/scoutastic/profil/spieler/652275Hákon ArnarHaraldsson2003-04-10T00:00:00+02:006000000
639712https://www.transfermarkt.de/scoutastic/profil/spieler/639712ÉtienneYouté Kinkoué2002-01-14T00:00:00+01:00300000
565060https://www.transfermarkt.de/scoutastic/profil/spieler/565060Pape IbnouBa1993-01-05T00:00:00+01:00400000
424072https://www.transfermarkt.de/scoutastic/profil/spieler/424072MathiasPereira Lage1996-11-30T00:00:00+01:002000000
422281https://www.transfermarkt.de/scoutastic/profil/spieler/422281Muhammed-ChamSaracevicMuhammed Cham2000-09-26T00:00:00+02:006000000
342229https://www.transfermarkt.de/scoutastic/profil/spieler/342229KylianMbappé1998-12-20T00:00:00+01:00180000000
93720https://www.transfermarkt.de/scoutastic/profil/spieler/93720AlexandreLacazette1991-05-28T00:00:00+02:0014000000
27389https://www.transfermarkt.de/scoutastic/profil/spieler/27389KévinGameiro1987-05-09T00:00:00+02:001200000
An extract of my "transfermarkt_players" dataset (504 player records).

A copy to S3, a couple of Glue Crawlers runs, and I was set to start exploring all these insights with Athena. For those not yet acquainted, Athena is a powerful query service offered by AWS that enables the analysis of data stored in Amazon S3 using SQL queries. Built on PrestoDB, it became a solid contender among AWS’ Data & Analytics suite, even against purposely-built data warehousing solutions such as Redshift.

Athena is, to me, the embodiment of serverless elegance: Provisioning data infrastructure and building ETL pipelines become relics of the past; instead, you only pay a modest fee for the queries you invoke, without even trading that much in terms of overall performances: Indeed, Athena supports creating and exploring data tables stored in columnar file formats, like Parquet or Avro, allowing you to tweak data partitioning and compression as you see fit, thus optimizing query performances while minimizing running costs. In many cases, I’ve seen Athena rival the responsiveness of dedicated data warehouses, even when grappling with colossal datasets, all at a fraction of the cost; and it is a solution that I would highly encourage anyone to explore for running data analytics on a tight budget.

Exploring fuzzy joining on Athena using the levenshtein_distance() function

The challenge at hand now consists of harmonizing and fusing both these datasets with pinpoint accuracy, a question that always equates to finding a solution for roughly the same underlying issues:

  • How to uniquely identify a relation between two records that refers to the same real-life entity, whether a competition, a team, or a player, at scale;
  • How to deal with mislabeled data, internationalization, and slight changes in how labels are dealt with from one dataset to the other.

This most often demands meticulous data preparation and normalization, a cleansing ritual that assures precise information alignment. Looking back at the data extracts above, any human would have no trouble picking up that the player records from the mpg_playerpools and transfermarkt_players tables, respectively, all refer to the same “real-life” individuals, despite some slight syntactic nuances, hiccups, differences in accentuation, diacritics or spelling.

But as you might imagine, doing so programmatically at scale is far more complex. This puzzle we’re dealing with is known as “fuzzy joining.” In this case, we are dealing with names from different datasets that might not match exactly but would be similar enough for us to assume a relation. Something that, in SQL, would translate to:

select
    table1."id" as table1_id
    , table2."id" as table2_id

from 
    table1

inner join
    table2

on
    some_kind_of_similarity_calculation(table1."name", table2."name") >= some_kind_of_threshold_value
What we'll be implementing in Athena.

A popular technique for comparing strings is computing their Levenshtein distance, a function natively available in Athena. This calculation gauges how different two records are by quantifying the minimum number of single-character changes (such as adding, deleting, or replacing a letter) required to transform one string into the other. The smaller the number of edits needed, the stronger the indication that these strings could reference the same thing. Leveraging this notion, we can craft a query that takes each record from the MPG dataset and pairs it with the record from Transfermarkt that shares the slightest single-character changes, a bit like finding its closest linguistic relative:

with mpg_players as (
    select
        src."id" as player_id
        , src."firstname" || ' ' || src."lastname" as player_name
    from
        "footballanalytics"."mpg_playerpools" as src
  
    where
        src."competition" = 'L1' and src."season" = '2023' 
)

, tmkt_players as (
    select
        src."id" as player_id
        , trim(src."first_name" || ' ' || src."last_name") as player_name

    from
        "footballanalytics"."transfermarkt_players" as src

    union
    
    select
        src."id" as player_id
        , trim(src."alias") as player_name

    from
        "footballanalytics"."transfermarkt_players" as src
    
    where
        trim(src."alias") <> '' or src."alias" is not null
)

, levenshtein_compute as (
    select
         mpg."player_id" as mpg_id
        , mpg."player_name" as mpg_name
        , tmkt."player_id" as tmkt_id
        , tmkt."player_name" as tmkt_name
        , levenshtein_distance( mpg."player_name", tmkt."player_name" ) as levenshtein

    from  
        tmkt_players as tmkt
        , mpg_players as mpg
)

, ranked as (
    select
        src.*
        , row_number() over (partition by src."mpg_id" order by src."levenshtein" asc) as row_rank
  
    from
        levenshtein_compute as src

)

, deduplicated as (
    select
        src.*

    from
        ranked as src

    where
        src."row_rank" = 1

)

select * from deduplicated order by levenshtein desc
A simple fuzzy join in Athena, using the Levenshtein distance

Here’s a detailed breakdown of what the query accomplishes:

  • The player’s first name, last name, and alias are consolidated into a unified player_name field for both the mpg_players and tmkt_players tables.
  • These tables are first merged through a cross-join operation, generating the levenshtein_compute table. The Levenshtein distance is computed in this table for each combination of names.
  • Then we compute a row_number() for each record in the ranked table. This operation is performed within partitions determined by a deduplication key—specifically, the mpg_id; the ordering, by the calculated Levenshtein distance in ascending order. This basically serves as a ranking mechanism based on the degree of similarity between both strings.
  • Finally the resulting data is filtered in the deduplicated table, only retaining the entries with the lowest Levenshtein distance for each mpg_id. In effect, this step eliminates all lower-ranked potential matches for each unique record.
Query performance
This query performs a cross-join operation, which consists of the cartesian product of the tables you try to join. For larger datasets, you may experience long query times. If so, a common optimization technique consists of grouping/partitioning your data further: In this scenario, for instance, only considering players part of a given championship or team.

Playing around with Levenshtein distances often yields early and tangible results: In my case, out of a pool of roughly 500 players, applying this trivial recipe returned 418 true positive matches…

mpg_idmpg_nametmkt_idtmkt_namelevenshtein
mpg_championship_player_102826Benjamin Benjamin Mendy157495Benjamin Mendy9
mpg_championship_player_476401Formose Formose Mendy649023Formose Mendy8
mpg_championship_player_484842Hákon Haraldsson652275Hákon Arnar Haraldsson6
mpg_championship_player_489230Hianga’a M’Bock684062Hianga’a Mbock2
mpg_championship_player_578153Abduqodir Khusanov763079Abdukodir Khusanov1
mpg_championship_player_122775Morgan Sanson174094Morgan Sanson0
In my scenario, ranking records matches by Levenshtein distance was enough to accurately match ~80% of the total records.

… for only a handful of false positive matches, to deal with:

mpg_idmpg_nametmkt_idtmkt_namelevenshtein
mpg_championship_player_492917Ugo Raghouber677820Hugo Barbet8
mpg_championship_player_546174Lucas Bonelli352392Lucas Perrin6
mpg_championship_player_499450Ethan Mbappé342229Kylian Mbappé4
mpg_championship_player_587783Joseph N’Duquidi453376Joseph Okumu7
mpg_championship_player_607126Malick Mbaye342229Kylian Mbappé7
The devil is in the details.

Not that bad of a start.

Removing accentuations and diacritics in AWS Athena

An elegant, simple, yet profoundly impactful practice that would contribute to refining our matching accuracy involves field normalization. In this instance, it entails removing the casing, accentuations, diacritics, or anything deemed superficial when comparing player names. The following SQL snippet, a reliable companion in my endeavors, achieves precisely this:

select
    str
    , regexp_replace( normalize( lower( str ), NFD ), '\pM', '' ) as normalized_str
from (
    select
        'Hugo Ekitiké' as str
)

Here’s a breakdown of the intricate dance behind this expression:

  • lower( ... ) ensures that both expressions are in the same case for consistent comparison.
  • normalize( ..., NFD ) transforms the string into a specified Unicode normalization form. In this case, NFD. This form decomposes accented characters into their base and separate diacritic characters. For instance, the character “é” would be decomposed into both an “e” and “´” Unicode character.
  • regexp_replace( ..., '\pM') uses a regular expression to remove all occurrences of diacritic characters (matched by the pattern “\pM”) from the original string.

This step will leave you with curated, ready-to-compare expressions:

strnormalized_str
Hugo Ekitikéhugo ekitike

Implementing advanced fuzzy join strategies in AWS Athena

Undoubtedly, one of my all-time favorite Python packages is the fuzzywuzzy library, a creation of the team at SeatGeek that they generously open-sourced. This library ingeniously utilizes the Levenshtein distance to propose solutions to some of the most common string comparison challenges; whether it is better accommodating to varying string lengths, disregarding word order, or handling deduplicated tokens, fuzzywuzzy thrives in proposing a more nuanced and precise approach to string comparison.

It is a package I highly recommend to anyone looking to enhance their data warehousing experience. However, there’s a catch when it comes to using fuzzywuzzy with Athena: Unlike its elder sibling, Redshift, Athena lacks the capability to configure external Python Lambda User-Defined Functions (UDFs) or even SQL UDFs, for that matter. While it is a feature I anticipate might someday come to life, it is today confined to only executing Java Lambda functions.


Fortunately, the power of the underlying Presto SQL engine comes into play, and implementing most of fuzzywuzzy’s functionalities natively is surprisingly achievable. In the upcoming sections, I provide SQL code snippets that mirror the string comparison strategies found in fuzzywuzzy. If you’re keen on delving deeper into each implementation, I recommend exploring Adam Cohen’s insightful post delving into the subject. As one of the primary contributors to the Python library, he offers an in-depth explanation, providing comprehensive examples inviting you into their thought process’s intricacies.

Simple ratio

Consider these two data entries:

mpg_idmpg_nametmkt_idtmkt_namelevenshtein
mpg_championship_player_102826Benjamin Benjamin Mendy157495Benjamin Mendy9
mpg_championship_player_499450Ethan Mbappé342229Kylian Mbappé4

These vividly demonstrate the limitations of solely relying on an absolute distance comparison for matching records. Unsurprisingly, some false positive matches (Ethan Mbappé) might compute to a shorter absolute Levenshtein distance than particular true positives (Benjamin Mendy). This situation that could become tricky if you aim to set a threshold beyond which you deem two records too dissimilar to match. To tackle this, a simple solution is to measure the edit distance between expressions relative to their maximum length, a simple ratio:

, simple_ratio as (
    select
        mpg."player_id" as mpg_id
        , mpg."player_name" as mpg_name
        , tmkt."player_id" as tmkt_id
        , tmkt."player_name" as tmkt_name
        , 1.0 * ( greatest( length(mpg."player_name"), length(tmkt."player_name") ) - levenshtein_distance( mpg."player_name", tmkt."player_name" ) ) / greatest( length(mpg."player_name"), length(tmkt."player_name") ) as ratio
    
    from  
        tmkt_normalized as tmkt
        , mpg_normalized as mpg
)
Implementation of fuzzywuzzy's simple ratio in Athena.

Let’s contrast the ranking calculation of that ratio with the levenshtein_distance() analysis previously established:

mpg_idmpg_nametmkt_idtmkt_nameratio
mpg_championship_player_122775morgan sanson174094morgan sanson1.0
mpg_championship_player_578153abduqodir khusanov763079abdukodir khusanov0.94
mpg_championship_player_489230hianga’a m’bock684062hianga’a mbock0.93
mpg_championship_player_484842hakon haraldsson652275hakon arnar haraldsson0.73
mpg_championship_player_499450ethan mbappe342229kylian mbappe0.69
mpg_championship_player_476401formose formose mendy649023formose mendy0.62
mpg_championship_player_102826benjamin benjamin mendy157495benjamin mendy0.61
mpg_championship_player_587783joseph n’duquidi453376joseph okumu0.56
mpg_championship_player_546174lucas bonelli352392lucas perrin0.54
mpg_championship_player_607126malick mbaye342229kylian mbappe0.54
mpg_championship_player_492917ugo raghouber677820hugo barbet0.46

Doing so slightly improves on our overall capacity to compare results similarity, leading to a higher ranking of true positive matches. It does not, however, improve on the algorithm’s resilience to variations in word order, the presence of additional, duplicated, omitted words, or other similar issues (e.g., “Ethan Mbappe” vs. “Benjamin Mendy”). These are challenges that the minds at SeatGeek have delved into, offering more sophisticated solutions, as defined below.

Partial ratio

The first method the SeatGeek team suggested revolves around gauging the similarity ratio by evaluating the optimal matching substring between both expressions. This tactic becomes particularly compelling when confronted with the challenge of accommodating many variations in string length. This problem may for instance arise when comparing players’ full names against their shorter bib names or nicknames.

, partial_ratio as (
    select
        partial."mpg_id"
        , partial."mpg_name"
        , partial."tmkt_id"
        , partial."tmkt_name"
        , reduce( 
            sequence( 1, partial."_seq"), length(partial."_longest"),
            (curr, idx) -> least(levenshtein_distance(partial."_shortest", substr(partial."_longest", idx, length(partial."_shortest"))), curr),
            s -> 1.0 * (length(partial."_shortest") - s) / length(partial."_shortest")
          ) as ratio

    from (
        select
            mpg."player_id" as mpg_id
            , mpg."player_name" as mpg_name
            , tmkt."player_id" as tmkt_id
            , tmkt."player_name" as tmkt_name
            , abs( length(mpg."player_name") - length(tmkt."player_name") ) + 1 as _seq
            , case 
                when
                    length(mpg."player_name") < length(tmkt."player_name")
                    then mpg."player_name"
                    else tmkt."player_name"
                end as _shortest
            , case 
                when
                    length(mpg."player_name") < length(tmkt."player_name")
                    then tmkt."player_name"
                    else mpg."player_name"
                end as _longest

        from
            tmkt_normalized as tmkt
            , mpg_normalized as mpg
    ) as partial
)
Implementation of fuzzywuzzy's partial ratio in Athena.

To translate this heuristic into a native Presto SQL expression, we begin by crafting a subquery. From within, we compute the absolute difference in lengths between the two player names. This step helps determine the range of lengths to account for in the Levenshtein distance computation.

Leveraging Presto’s capabilities of working with Array structures, we iterate over subsets of the longest player name to calculate the similarity distance between the “shortest” record and the various segments of the “longest” one, keeping track and returning the smallest value identified during this process.

Let’s examine the repercussions on the row rankings:

mpg_idmpg_nametmkt_idtmkt_nameratio
mpg_championship_player_102826benjamin benjamin mendy157495benjamin mendy1.00
mpg_championship_player_122775morgan sanson174094morgan sanson1.00
mpg_championship_player_476401formose formose mendy649023formose mendy1.00
mpg_championship_player_578153abduqodir khusanov763079abdukodir khusanov0.94
mpg_championship_player_489230hianga’a m’bock684062hianga’a mbock0.86
mpg_championship_player_499450ethan mbappe342229kylian mbappe0.75
mpg_championship_player_484842hakon haraldsson652275hakon arnar haraldsson0.69
mpg_championship_player_587783joseph n’duquidi453376joseph okumu0.58
mpg_championship_player_607126malick mbaye342229kylian mbappe0.50

Progress is evident, yet we’re not at the summit just yet. Confining our comparison to the most fitting subset of both expressions has enhanced the overall algorithm robustness. Yet, a flaw remains evident: its sensitivity to word order, such as in the case of “Ethan Mbappé” outranking “Hákon Arnar Haraldsson,” despite the substantial similarity of the latter expressions.

Token sort ratio

To circumvent ordering issues, fuzzywuzzy proposes dissecting each expression into an array of discrete words. These tokens are then rearranged in alphabetical order, effectively sidestepping any issues arising from disparities in the original word sequences. An Athena implementation of this approach can be found below:

, token_sort_ratio as (
    select
        tokenized."mpg_id" as mpg_id
        , tokenized."mpg_name" as mpg_name
        , tokenized."tmkt_id" as tmkt_id
        , tokenized."tmkt_name" as tmkt_name
        , 1.0 * (greatest(length(tokenized."_token_sort_1"), length(tokenized."_token_sort_2")) - levenshtein_distance(tokenized."_token_sort_1", tokenized."_token_sort_2")) / greatest(length(tokenized."_token_sort_1"), length(tokenized."_token_sort_2")) as ratio
    from (
        select
            mpg."player_id" as mpg_id
            , mpg."player_name" as mpg_name
            , tmkt."player_id" as tmkt_id
            , tmkt."player_name" as tmkt_name
            , array_join(array_sort(regexp_extract_all(mpg."player_name", '[a-z]+')), ' ') as _token_sort_1
            , array_join(array_sort(regexp_extract_all(tmkt."player_name", '[a-z]+')), ' ') as _token_sort_2
        from 
            tmkt_normalized as tmkt
            , mpg_normalized as mpg
        ) as tokenized
)
Implementation of fuzzywuzzy's token_sort_ratio Athena.

To execute this strategy within a functional Presto SQL expression, we capitalize on the regexp_extract_all() function to extract all words from the “player_name” columns, across both datasets. We rearranged those tokens and joined them again into a unified string, for comparison.

Running this query on our data subset outputs the following results:

mpg_idmpg_nametmkt_idtmkt_nameratio
mpg_championship_player_122775morgan sanson174094morgan sanson1.00
mpg_championship_player_456300lucas perrin352392lucas perrin1.00
mpg_championship_player_578153abduqodir khusanov763079abdukodir khusanov0.94
mpg_championship_player_484842hakon haraldsson652275hakon arnar haraldsson0.73
mpg_championship_player_499450ethan mbappe342229kylian mbappe0.69
mpg_championship_player_476401formose formose mendy649023formose mendy0.62
mpg_championship_player_102826benjamin benjamin mendy157495benjamin mendy0.61
mpg_championship_player_607126malick mbaye342229kylian mbappe0.54
mpg_championship_player_492917ugo raghouber677820hugo barbet0.46
mpg_championship_player_489230hianga’a m’bock684062hianga’a mbock0.40

Token set ratio

Finally, and arguably the most pertinent for our pursuits, an implementation of the token_set_ratio heuristic, a variant of the latter strategy but with an extra twist: Rather than just sorting and assembling words, we omit duplicate words while attaching a higher “weight” to the terms shared between both expressions by ordering those at the forefront of each string value:

, token_set_ratio as (
    select
        tokenized."mpg_id" as mpg_id
        , tokenized."mpg_name" as mpg_name
        , tokenized."tmkt_id" as tmkt_id
        , tokenized."tmkt_name" as tmkt_name
        , reduce( 
            combinations( ARRAY [tokenized."_token_interception"] || trim(tokenized."_token_interception" || ' ' || tokenized."_token_set_1") || trim(tokenized."_token_interception" || ' ' || tokenized."_token_set_2"), 2), 0.0,
            (curr, comb) -> greatest(1.0 * (greatest(length(comb[1]), length(comb[2])) - levenshtein_distance(comb[1], comb[2])) / greatest(length(comb[1]), length(comb[2])), curr),
            s -> s
          ) as ratio
    from (
        select
            mpg."player_id" as mpg_id
            , mpg."player_name" as mpg_name
            , tmkt."player_id" as tmkt_id
            , tmkt."player_name" as tmkt_name
            , array_join(array_sort(
                 array_intersect(
                    regexp_extract_all(mpg."player_name", '[a-z]+'), 
                    regexp_extract_all(tmkt."player_name", '[a-z]+')
                )
            ), ' ') as _token_interception
            , array_join(array_sort(filter(
                regexp_extract_all(mpg."player_name", '[a-z]+'),
                item -> not contains(array_intersect(
                    regexp_extract_all(mpg."player_name", '[a-z]+'), 
                    regexp_extract_all(tmkt."player_name", '[a-z]+')
                ), item)
            )), ' ') as _token_set_1
            , array_join(array_sort(filter(
                regexp_extract_all(tmkt."player_name", '[a-z]+'),
                item -> not contains(array_intersect(
                    regexp_extract_all(mpg."player_name", '[a-z]+'), 
                    regexp_extract_all(tmkt."player_name", '[a-z]+')
                ), item)
            )), ' ') as _token_set_2
        from 
            tmkt_normalized as tmkt
            , mpg_normalized as mpg
        ) as tokenized
)
Implementation of fuzzywuzzy's token_set_ratio in Athena.

Here, we extract shared tokens from both expressions (array_intersect()) and retain the exclusive, remaining tokens for each set (filter()). We then reconstruct the strings for comparison, through a concatenation (combinations()), following this pattern: [COMMON_TOKENS] + [UNIQUE_TOKENS], before computing the similarity ratio between the resultant strings.

Let’s now examine the influence of this change on the row rankings:

mpg_idmpg_nametmkt_idtmkt_nameratio
mpg_championship_player_484842hakon haraldsson652275hakon arnar haraldsson1.00
mpg_championship_player_102826benjamin benjamin mendy157495benjamin mendy1.00
mpg_championship_player_122775morgan sanson174094morgan sanson1.00
mpg_championship_player_476401formose formose mendy649023formose mendy1.00
mpg_championship_player_578153abduqodir khusanov763079abdukodir khusanov0.94
mpg_championship_player_489230hianga’a m’bock684062hianga’a mbock0.80
mpg_championship_player_499450ethan mbappe342229kylian mbappe0.69
mpg_championship_player_607126malick mbaye342229kylian mbappe0.54
mpg_championship_player_546174lucas bonelli352392lucas perrin0.54
mpg_championship_player_587783joseph n’duquidi453376joseph okumu0.50
mpg_championship_player_492917ugo raghouber677820hugo barbet0.46

By narrowing our focus to a curated set of tokenized words, and assigning greater significance to common words shared between the expressions, we’ve established an approximate 80% similarity “cut-off” threshold without sacrificing any true positive matches.

Building on my past encounters, coupling this methodology with other deterministic variables—in this case, the player’s birth date, nationality, current club, or the time since their last match—unlocks the gateway to remarkably precise merges. This adaptable technique can find relevance in many real-life scenarios, from scrutinizing email or postal addresses to standardizing free-form text fields within your organizational datasets.

Fuzzy joined football datasets and next steps

Embarking on the captivating journey of implementing fuzzy joining in Athena, we’ve attained quite a feat, that shines brightly: a triumphant match of 483 out of 538 players (with the remaining few mostly being players transferred out of the championship) from the MPG game player pool, with their counterparts in the Transfermarkt universe. In doing so, we have already opened up to a realm of newfound capabilities, granting us nearly exclusive access to the hidden tapestries of each player’s historical performances, injuries, and more. Yet, this small victory only marks the prologue of our grand tale.


As we stand on this threshold, I plan to unveil an upcoming series of articles that will unfold across this fantasy football season. Each piece will plunge us into the craft of data analytics & visualization, starting with the decision-making process for building our ultimate championship-winning roster. As the season evolves, so will our approach, refined with each iteration. I hope to bring this saga to traverse through the dimensions of advanced statistical strategies and Machine Learning techniques applied to sports analytics, inching us ever closer to our coveted data-driven supremacy.

For those who find joy in the science of football, I invite you to join me in my quest by subscribing to this RSS feed. And to the passionate sports minds out there, curious about harnessing the power of data in their respective organizations, feel free to connect, as my team and I will be happy to assist you, and welcome your passion and curiosity into our data-driven world.