Troy Hernandez for Pilsen Academy Community Representative

CommunityRep_2016

I’m running for another 2-year term on the Pilsen Academy Local School Council (LSC).  As the community representative and secretary, these past 2 years have been a lot of work, but it’s been worth it.

Last week, the LSC voted to offer a contract to our new principal; Leanne Hightower. She will be starting next week!

Leanne comes to us with a JD from U of I, masters degrees in education and school administration, 5 years of teaching experience and 4 years of administrative experience. With an almost unanimous decision (9 in favor, 0 against, 1 abstention) the LSC showed that it is very excited for this fresh start.

We worked very hard and very closely with parents, members of CTU, CPS administrators, and the community to bring in this strong candidate.  It hasn’t always been easy, but knowing that I’ve been able to make a difference in the lives of teachers, staff, and students at my neighborhood school has made it worthwhile. My hope is that the kids at Pilsen Academy can get the kind of high-quality public education I got in the suburbs.  Hiring our new principal is the first step in that direction and we’re all excited for the next steps with her leadership.  I hope I will be able to work with her over these next two years.

I’m running with some friends I’ve made from the current LSC; President Dolores Cortez, parent representative Maria “Lupe” Gonzalez, along with parent representative-hopeful Kay Allen, and community representative-hopeful Teresa Gonzalez.

image

From left: Me, Kay, Dolores, Lupe, and Teresa


Looking back…

When I was first elected to the LSC two years ago, I thought I was signing up to practice my Spanish over bad coffee at an early morning meeting once a month.  My hope was that, as a Mexican-American with a PhD in statistics, I could serve as a role-model to the neighborhood kids.  I would give a few STEM talks/demos (science, technology, engineering, and mathematics) and I’d have done my duty.  My two years were unfortunately not so easy or pleasant.

Walking home from the first official LSC meeting a couple of council members followed me out, pulled me aside, and said that they wanted a new principal.  They said that the current principal had created a culture of fear and was driving good teachers away.  They wanted my help.

I was skeptical and wondered what their angle could be.  The principal seemed nice enough on election night.  The Pilsen Alliance “volunteer” that was helping me out, Vicky Lugo, said he was terrible.  I was apt to believe the opposite after it turned out that the organization was filled with political opportunists.

Then I started to notice some troubling behavior from the principal.  I presumed incompetence before I presumed malice.  As I became acclimated to the process, I started to ask questions.  My honest questions were repeatedly met with violent responses from the principal… like the time I respectfully questioned increasing his unsupervised spending limit from $1,000 to $10,000.

I thought that if he was comfortable yelling at one of the people responsible for approving his contract, I could only imagine what it was like to have to work for him.  That our teacher turnover rate was 50% higher than the rest of the neighborhood’s schools and twice the rate of the state’s schools only affirmed my judgement.

Bare-Knuckle Politics

I kept my politics separate from my service at the school.  I initially made no mention of my election to the LSC here on my blog.  When I was collecting signatures for my aldermanic run, none of the members on the LSC knew about it until it was over.  The principal was not so noble.

He got rid of the teacher representative on the LSC (and his brother), he repeatedly prevented the LSC from conducting its business, and held illegal meetings.  In December, the LSC decided to not renew the principal’s contract by a vote of 7 – 2 – 1.  We thought that with things settled we would be able to focus on moving forward.  Nope.

The principal decided to play games.  He proposed a $60,000 budget transfer with no advance notice to the LSC.  The LSC was confused (translations are frequently lacking) and punted with 6 parents and staff abstaining from the vote.  He then sent a letter home to the parents and teachers accusing the LSC, specifically me, of trying to ruin the school.  This was followed by more chaos, including teachers getting uncharacteristically bad reviews and posted LSC agendas disappearing.

In response, the LSC voted to send a letter to CPS CEO Forrest Claypool.  We requested that Dr. Ali be removed from the school immediately.  This was just over 45 days ago.  Last week, I was informed that Ali had started threatening undocumented parents with calls to immigration.  Those parents complained publicly to the CPS Board last week.  Monday was Dr. Ali’s last day at Pilsen Academy.

In the bricks

While it’s been great to work and serve with the families in my community, I’ve been put off by the amount of politics in a grade school.  It hasn’t just been the principal either.  Local polluters make donations for the school’s floats in parades and members of Pilsen Alliance have used it as a pawn and propaganda piece.  But I guess I shouldn’t be surprised.  This is Chicago.  Politics are in the bricks that were used to build this school over 100-years ago.

The best we can do is to show up and try to minimize the negative effects that politics has on the school.  The current LSC and the new principal are in agreement on that matter.  My hope is that we can maintain this mindset during and after the election.  Then we’ll be able to get to the more important and fun things in the school.

Analyzing 538’s Democratic Primary Analysis

I get into the extended discussions about politics on Facebook.  Sorry, not sorry.  Recently, I got into an in-depth analysis of Bernie’s chances going forward.  Given his big wins in Washington, Hawaii, and Alaska, I pointed to this great project on fivethirtyeight.com to see how much he caught up.

Who’s on Track for the Nomination? describes itself like this

Tracking a candidate’s progress requires more than straight delegate counts. We’ve estimated how many delegates each candidate would need in each primary contest to win the nomination. See who’s on track and who’s falling behind.

It’s great and you should check it out.  It will serve as the source of data in this post.  In short, Hillary is supposed to be up right now.  She has a bigger lead than she should at this point, but is it enough?

My initial argument was that Bernie has moved up monotonically from 81% of his projected target to 92% since late February and in that respect is doing well.

I (perhaps too frivolously?) brushed aside arguments that he isn’t polling well in New York, given that he was polling in the low teens 3 weeks before the Illinois primary.  He forced a draw here.  It’s now 3 weeks before the New York primary.

I’ve ceded the argument that Bernie is over-performing in caucus states and that those are for the most part gone.

Analogies were made to Bernie being down late in the game.  “Even though he’s gaining ground, he’s not gaining fast enough.”  We are now through 55% percent of the delegates.  This is equivalent to being in the top of the 6th inning.  So what does Hillary’s bullpen look like relative to her starting pitching?  I think the answer to that is, “Not good.”

Caveat emptor: The problem with all of these quick analyses is that there are 1000’s of variables in the world and we’re likely to find at least a few that look very predictive… just by chance.  Correlation is not causation and all that.  You’ve been warned.


I decided to plot fivethirtyeight’s target delegate counts against the delegates she actually won.  Each data point represents a state (or territory).  The line running through the middle represents a candidate getting as many delegates in a state as they would need to get the nomination.  Anything over that line means they over-performed in that state.  Under the line means they under-performed in that state.  The colors correspond to liberal (blue), conservative (red), and swing states (purple).  We get this plot for Hillary’s performance

Target538_Clinton

What’s surprising here is how close 538 has been to reality (check out Michigan!) with a few notable exceptions at the top end; for Hillary that’s Florida and Texas.  This suggests to me that whatever model the authors (Aaron Bycoffe and David Wasserman) were using to construct this model was actually pretty good.

The other thing that pops out is that Hillary appears to be over-performing in purple and red states.  She isn’t doing as well in blue states.  To highlight this, I fit a very simple linear models seperately for the blue and the red/purple states.  Those models are represented by the red and blue lines around the black one

Target538_Clinton_RB

 

Bernie’s plot shows something similar

Target538_Sanders_RB

Whether or not it holds up remains to be seen.  It’s a small sample size, but it does pass the sniff test.  More liberal states support Bernie… sounds about right.  If this trend holds up then it begs the question…

Are the remaining primary states closer to Texas (red), Florida/Ohio/Michigan (Purple), or Illinois/Washington (Blue)?  The biggest upcoming states are Wisconsin (purple), New York (blue), Maryland (blue), Pennsylvania (purple), Indiana (red), California (blue), and New Jersey (blue).  I wrote some R code to do all of this analysis (see below), so it was trivial to calculate the exact answer.

This is where things get crazy.  Given the criticisms of DNC’s handling of the primary process, maybe I shouldn’t have been as surprised.

Blue Shift

57% of the pledged delegates have been voted on/pledged to their respective candidates.  Of those delegates a whopping 83% of them have come from red or purple states!  Only 17% have come from blue states.  Of the remaining 43% of delegates who have yet to be voted on, 66% come from blue states!

The electorate that is about to vote comes from substantially more liberal states.  If the trend of Bernie over-performing in liberal states holds, then this could be good news for Bernie.  I ran the numbers to see if Bernie outperforming his blue state requirements by the estimated 11% would be sufficient to win.

It’s not.  He’s lost too much ground in the red and purple states.  He’d end up with 1945 delegates; about 80 short of the nomination.  He’d have to either do better in the red and purple states, or do 25% better than targets in the blue states.  Neither of which are easy.

Or maybe this is all just noise.  Who knows?  Only 5 blue states have voted so far, we have 10 more to go.  There’s a lot that could happen.  Either way, it should continue to be an exciting primary season.


The code and data used to generate the data and plots can be found here:
https://github.com/TroyHernandez/Targets538

Statisticians come in from the cold

Two weeks ago my paper, Descriptive Statistics of the Genome, was accepted to the Journal of Computational Biology.  I think it was the best part of my thesis, so I’m excited to see it finally being published.

Lectures

Last week I gave my first talk/lecture at the monthly luncheon of the Chicago Chapter of the American Statistical Association.  That was my first talk in a couple of years.  Last night I gave almost the same talk at my alma mater.  The title is the same as this blog post.  The thesis is that the popularity of data science has led to the creation of tools that allow statisticians to dominate the data science world with minimal effort.

What is a data scientist?

Instead of pontificating on the frequently discussed topic, here is a panel of smart people. Instead of ruining the surprise, I’ll let you listen to this over your morning coffee.

I closed this section by referencing Zoubin Ghahramani’s point that we need to be mindful of the whole pipeline of data and not just the statistics and machine learning.  This pipeline includes what is commonly known in industry as data engineering and visualization or front-end web design.  His point is that machine learning and statistics are just a small part of the data science pipeline.  Like this:

Zoubin1

The point of my talk(s) was that with the modern tools developed by the popularity of data science, machine learning, and R the pipeline can look more like this:

Zoubin2

All your data science are belong to us!

Required tools

  • BigQuery – a cloud-based SQL-like super-fast database system
  • bigrquery – an R package that allows you to pull data from
    BigQuery into the R environment
  • Shiny – “A web application framework for R. . . No HTML,
    CSS, or JavaScript knowledge required”

As an example I used data from a favorite website of mine, reddit.com

Because I’m busy, I was able to find a couple of people who had put all of reddit’s data into a public BigQuery dataset (fhoffa@reddit) and had created a couple of queries and ggplot graphs to display the data (Max Woolf).  So my work was already 2/3rds of the way done!  That’s efficiency.  Thank you people!  All I had left to do was turn it into a Shiny App.  And that’s what I did.  You type in a subreddit and/or reddit username and you’ll get a word cloud and the posting time of the most popular posts matching that description in seconds, as soon as you make like Captain Pickard and “Engage”:

The code is here.  Apologies, it’s not the prettiest.  You do have to get a free BigQuery account and put your account’s project id into the quotes that point to project in the server.R file (line 7).

Conclusion

I finally opened up a shiny.io account from Rstudio.  That’s where that Shiny output above is coming from.  It integrates nicely with Rstudio, github, and (hopefully, we’ll see when I hit publish) but not iframe-ing into this WordPress site. (It looks like I’ll have to up my game to make that happen.)

So there you have it.  111GB (or 215m rows) of information processed in a matter of seconds and displayed to you using only R and web services.  It appears as if all your data science are belong to us.

I’m Suing Civis Analytics

My PhD is in machine learning. Politics is my hobby. Civis Analytics is a politically-connected machine learning tech-startup.  I was very excited to work there.

After an awkward conversation with a friend and colleague, I came to find out I was being paid 20% less than the PhD physicist that I had to train in machine learning. Civis is a machine learning startup. Civis is not SpaceX. When we discussed our salaries, he apologized to me and said I should be senior to him. Were his 9 years of doing academic research in a tangentially-related field 25% more valuable than my 8 years of research in the relevant field?

There was another physicist with some machine learning knowledge and some industry experience. I showed him more advanced material and coached him through complicated algorithms. He advocated for me to an almost embarrassing degree… circling by my desk when the executives were around, heaping praise in a loud voice.  He also apologized to me when I told him my salary.

I had a colleague who graduated with a master’s degree in analytics. After she graduated she worked for a year elsewhere before coming to Civis, just like me. I was explicitly told that I was supposed to train this person. We were both given the same offer and both negotiated up to the same starting salary.

TroyHernandezPassing

You might think this has nothing to do race or ethnicity. After all, I’m only half-Mexican and I frequently pass for white. This is why I’m occasionally put in awkward positions:

Acquaintance: The damn spics are ruining this country!

Me: What!? I’m Mexican you f$#!# @$$#%^&!

Acquaintance: Oh, uh… Sorry, I wasn’t talking about you.

Me: Who were you talking about? My dad? My grandfather, the WWII vet?

This even happens with people who know I’m Mexican. I am just so Assimilation: Mission Accomplished that they forget.

A slip like this happened 3 months into my time at Civis. The CFO was quoted in a “humorous” email thread where he referred to speaking Spanish as “speaking poor”. This is the Chief Financial Officer; the guy who was cutting my checks.

It’s not a stretch to think the guy who sees Spanish and thinks poor, saw my last name at the top of my resume and thought:

Latinos: Like White People, Only Cheaper

I began to suspect that my 4-month interview process took so long because this Goldman Sachs alum was just practicing “great arbitrage”.

But wait, there’s more

One distasteful comment and a drawn-out interview process weren’t enough for me to raise a fuss. I grew up in the suburbs navigating conversations like the one above.

In true Dunning-Kruger fashion, there were the incompetent coworkers/managers that insisted I was incompetent. I was removed from listservs repeatedly and wasn’t invited to important meetings. I tried to navigate the situations with subtlety. When I wasn’t subtle enough, one of my managers yelled at me, and then followed me into the bathroom and watched me pee before exiting. That was weird.

When one of the female engineers raised the issue of gender balance, I pointed out that I was the only under-represented minority in the data science team and that, even with a PhD in machine learning, it took me 6 months to get into the data science (i.e. machine learning engineering) department and I represent only half of one under-represented minority. That was followed by a Co-founder/VP who, like the Twitter executive, said they’d love to hire under-represented minorities but that they couldn’t afford to train people on the job. It took me a bit to figure out why this bothered me, but then it hit me. She was presuming under-represented minorities needed training, just like her colleagues presumed I was incompetent… WHILE I WAS TEACHING THE WHITE PHYSICISTS!

After this I had a couple of the awkward salary conversations above and finally figured out that I was subject to a 20% discount (in addition to the 20% progressive politics discount).

All of the drama was taking a major toll on my personal life. But Civis wasn’t done with me yet! I was riding in the elevator with my boss casually talking about troubles at the local school council when the CEO’s college roommate decided the trouble was due to me being a “fake Mexican”. My boss laughed. I was again going to let it go, but then the CEO’s college roommate got promoted (read: old boys’ club). I left shortly thereafter.

Tech and Politics

I didn’t file this suit just for the money. There’s little chance that I’ll get enough money to compensate for the stress. When the company I went to after Civis found out about this lawsuit, my manager and the HR department proceeded to gaslight me until I left.  But I’m good at my job, the market is hot, and I’m again gainfully employed.

Part of the reason I filed this suit is to help bring to light just how bad it can get for under-represented minorities in tech and politics. There are under-represented minorities in tech who don’t pass for being white, whose faces won’t let people forget they are different.  If I, a mixed-race Chicagoan (like our President) with the hottest PhD in the tech industry, can’t get a fair shake at the “Obama 2012 Analytics Team”, what chance does a talented Black or Latino candidate with a bachelor’s degree stand at a company that’s not explicitly “progressive”?

Don’t imagine that my experience is unique:

Top universities turn out black and Hispanic computer science and computer engineering graduates at twice the rate that leading technology companies hire them, a USA TODAY analysis shows.

When you dig into the statistics, while black engineers get hired at a lower rate than Hispanics, once black people do get tech jobs they are paid close to what white engineers make. If you look at the plot below, the same is not true for Hispanics.

Civis

If you break the numbers down you see that race and gender are almost independent. Relative to white males, the discounts are:

  • Black – 4%
  • Female – 6%
  • Asian – 8%
  • Hispanic – 16%

Gender and race are additive, so if a black female gets a job, she gets paid (6% + 4% =) 10% less than a white male. At Civis my discount was at least 20%. Civis was beating the industry standard!

“There’s this big narrative in the women’s movement: 78 cents on the dollar. Everyone knows what that means. It’s less talked about when it comes to race,” said Laura Weidman Powers, co-founder and CEO of Code2040, a non-profit that nurtures black and Hispanic tech talent.

Race and racism is talked about so little in tech that, while the latest O’Reilly Data Science Salary Survey questioned about gender and found a significant pay gap, they forgot to ask about race or ethnicity entirely. It should be noted that O’Reilly Media founder, Tim O’Reilly sits on the board of Civis Analytics (along with the Chairman of Google (Alphabet) Eric Schmidt).

Conclusion

Recruiters don’t see my pale skin or hear my suburban English. When I apply they see HERNANDEZ at the top, and in reading the tea leaves, the statistics show that they round down. The same will likely be true for my nieces and nephews. Again from USA Today:

“At every point in the hiring process hidden bias trickles in,” Klein said. “A drop at the stage of reviewing names on résumés, a few more drops at the stage of different gender and race styles of presentation during interviews and a steadier stream when it comes to who is expected to negotiate their salary and who isn’t.”

It’s even more troubling because there is evidence of bias against women and there is evidence that STEM professionals don’t believe that it exists. The same is likely true for race.

One landmark study found that science faculty at research universities rate applicants with male names as more competent, more hireable, and more deserving of a higher starting salary than female applicants, even when the resumes are otherwise identical.

And then the problem goes meta:

male STEM professors were much less likely to believe the evidence of gender bias against women in their own field. Importantly, the researchers’ analysis showed that the cause of this statistical interaction was that male STEM professors were more likely to judge the research harshly, not that female STEM professors were more likely to view it positively.

These hypocritical biases seem to be very persistent. Not just in tech, but also in politics.

Political operative Michael Gomez Daly worked on two congressional campaigns in 2012 with similar budgets. On one campaign, Daly, who describes himself as “a very light-skinned Hispanic,” was brought in as a field director, primarily for his skills as a Latino operative who could reach out to the Hispanic community. On the second campaign, where they did not know he was Hispanic, “I just came in as ‘Michael Daly,’ instead of ‘that Latino operative,’” he said. “Right off the bat they offered me twice the amount for the same job.”

A progressive PR firm just shut down because the guy who was supposed to be promoting progressive causes was simultaneously committing serial sexual assault.

An associate professor of management at the MIT Sloan School of Management, Emilio Castilla (suddenly skeptical of this opinion?), sums it up nicely in USA Today:

His research has found that managers in organizations that promote meritocracy actually show greater bias in favor of men over equally performing women in rewarding merit-based bonuses and promotions.

“The lesson is not that companies shouldn’t adopt merit-based practices but that the pursuit of meritocracy is more difficult than it first appears. If not designed and implemented carefully, merit-based practices may trigger bias against women and ethnic minorities.”

The case filing is here.

Plata y Plomo in Pilsen

On Wednesday there will be a meeting with the US EPA hosted by the Pilsen Environmental Rights and Reform Organization (PERRO) in the basement of the church at 1850 S Throop.

For the past decade PERRO has been pushing the EPA to investigate the emissions coming from the H Kramer smelting facility. From Wikipedia:

H. Kramer and Company is a brass smelting company located in the Pilsen neighborhood of Chicago, Illinois, United States. The company has come under pressure from local neighborhood residents and the Illinois EPA for lead pollution.

PERRO had early success in getting the company to install pollution controls.  That has helped to prevent the facility from further polluting the neighborhood with the lead (plomo) required to smelt brass.

This meeting is about cleaning up the lead that was emitted by the facility over its 125-year history. That lead was carried by the wind and deposited throughout the neighborhood. The map shows where the lead levels are highest (Res 1) and… not as high (Res 3).

image

The Process
The EPA will first sample the yards in the affected area with the approval of the property owners. The yards with elevated levels of lead will be dug up and replaced with unleaded dirt and sod at no cost to the owners!!! (plata)

As most have been made aware by the recent events in Flint, lead is extremely hazardous to children and pregnant women. In Chicago this problem is compounded by the lead in the paint of old apartments and the lead that can be present in your water when they replace the water mains.

PERRO

I only became a regular member of PERRO two years ago when it started to become clear that many members of Pilsen Alliance weren’t sincere about their professed concerns for the community. That has not been the case with PERRO

Anyone can jump onto a cause for a year or two, especially when there is corporate or grant money available. But to stick with a cause that’s been largely ignored for over a decade with little to no recognition or money… that requires real concern and sincerity. The neighborhood should thank these members on Wednesday.

Single-celled Trump Says…

SingleCelledTrumpSaysA team at the University of Oregon has found a single mutation that may explain the evolution from single-celled to multi-cellular life.  From the reddit ELI5 (Explain Like I’m 5) comment:

In order to grow your lungs and other important organs in your body, cells have little rods that they point at other cells which then fit together in order to create structure, otherwise known as “Tissue”.  This function is primarily controlled by a process that fits cells together like puzzle pieces using proteins on the exterior of cells that act as the receivers for the rods to fit into.  We suspected that cells didn’t always have these receivers, (or at least, the proteins on the surface as cells didn’t function as receivers) and through our research, we were able to simulate this evolution that created these receiver proteins and thus allowed cells to form into tissues.  We also were able to demonstrate that its a very simple process that likely could have happened naturally to ancient cells, creating the first tissues that are now essential to all complex life.

It’s an interesting piece of foresnic biology.  It’s also a reminder that sometimes

cooperation >> competition

And this is where I get up on the soapbox, because this message has unfortunately been the antithesis of the last 35 years of US politics.  Competition has been the answer to everything and anything: Healthcare? Needs more competition.  Banking? Needs more competition.  Education? Needs more competition.

If you step back and think about it, it’s ridiculous.  “All of the complexity and problems in the universe are always best solved by competition!  It’s a universal truth!”  Except that multi-cellular life is a pretty good counter-example.  It’s the kind of thinking that you get from a single-semester of economics.  You gloss over the conditions required for a perfect market.  You gloss over the conditions required to attain Pareto optimality, conditions not attained in even the simplest of games… most notably, the Prisoners Dilemma.

Of course, simple ideas are more attractive than doing the hard work of understanding a complicated reality.  That explains some of Trump’s success.   Analysis shows that he speaks at a 4th-grade level.  That’s why I love obvious counter-examples like multi-cellular life or Medicare, Social Security, or the fire department.  They are all examples that highlight the fact that we live in a mixed economic reality; part cooperation, part competition.  It’s time to suck it up America.  Multi-cellular life is complicated.  Vote Bernie 2016.

Optimizing Fanduel in R

I want to give a little background on my experience with fantasy football first.  If you’re just looking for the algorithm just skip down to the section titled The Algorithm.

Fantasy Football

I like Chicago Bears football.  Even during rough seasons like this one.  I don’t really care that much about other football teams.  So when my friends first invited me to play fantasy football back in 2006, I bought a $5 magazine with rankings of every player.  I didn’t do well that year.  So the next year I decided to create a drafting algorithm.  At the time I was just getting started in R, so I wrote it in Excel.  I won the championship that year.

I quit playing for a few years when grad school was intense and recently un-quit.  I rewrote the majority of the program in R last year and completed the transition this year.  I’ve won my division both years!  I’ve been tempted to make it into an app, but there’s already a ton of them.  One day I’ll get the time…

Fanduel

Daily fantasy sports (DFS) became huge this year.  It got so big that it’s caught the notice of the attorneys general of states like Washington and New York, where they are looking for losers to get their money back.  Illinois has now jumped on the anti-DFS bandwagon.

A few months ago I looked into Fanduel and confirmed what I thought to be true… it’s an optimization problem of the kind that I studied in my operations research (OR) classes, different from a season draft in a few important ways.  So I put $200 into an account, downloaded the player data from Fanduel and other sources, wrote a program to munge the data, and then applied a good ol’ linear programming optimization to it.

With some conditions, linear programming is useful whenever you have limited resources ($60k for player salaries) and a linear combination of values that you want to maximize (fantasy points) or minimize.  It was made to help us defeat the Nazi’s, literally.

Linear programming arose as a mathematical model developed during World War II to plan expenditures and returns in order to reduce costs to the army and increase losses to the enemy. It was kept secret until 1947. Postwar, many industries found its use in their daily planning.

I compared my results to those other online algorithms and they were close, but not exact.  This told me that I was doing it right and that it was worthwhile to do it myself.  After 2 weeks I won 3 of the 6 contests I’d entered and was up to almost $300 in my account.  I had dreams of quitting my day job.  Then I proceeded to lose every contest I entered until I had lost it all.  It turns out that player scoring is highly variable in fantasy football.

Given the Illinois Attorney General’s stance on DFS and the NFL playoffs upon us, I figured I should post this before the teachable moment of linear programming passes.  It’s a rare opportunity to have an algorithm so perfectly suited to 15-minutes of pop culture fame.

The Algorithm

This algorithm optimally allocates your $60k fantasy budget so that you get the most points without going over budget. If you knew in advance how many points each player would score, this algorithm would guarantee you have the best team. Of course, you don’t know that.

I’m going to use the kind of data you’d get from a DFS site and just use the average player points per game as their expected scores. I’ll leave out fancier models that modify expected player scores utilizing integration with outside data. I previously used a fancier model and it didn’t win me any money. That doesn’t mean you won’t get it to work!  I mean, it probably won’t, but web scraping in R is a topic for another post!

The data is straight-forward; rows for each player, columns for player name, position, expected points, and salary. First we read the data in and order it by position (for clarity below).

dat <- read.csv("DFS.csv")
fd <- dat[order(dat[, "Position"]), ]

It looks like this:

library('knitr')
kable(head(fd), format = "markdown", row.names = F)
Name Position Points Salary
Seattle Seahawks D 10.8 5100
Kansas City Chiefs D 10.2 5100
Houston Texans D 6.8 4600
Pittsburgh Steelers D 9.4 4500
Minnesota Vikings D 7.3 4500
Green Bay Packers D 7.2 4500

As is usually the case, we’re going to want to change that Position factor and turn it into indicator/dummy/binary variables. Luckily there is a package that makes that easy called dummies.

install.packages('dummies')
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
Position.Mat <- dummy(fd[, "Position"])
colnames(Position.Mat) <- levels(fd[, "Position"])

Additionally, we’ll need a column for the flex position. Actually, we don’t need this.  I originally wrote out the program thinking I’d need it, but for FanDuel, you don’t.  You’ll notice this is now handled in the constraints section.  If you are a RB, WR, or TE you are an eligible flex player.

Position.Mat <- cbind(Position.Mat, Flex = rowSums(Position.Mat[, c("RB", "TE", "WR")]))

Now that we have the data munged, I’ll be using the lpSolve package to select the optimal players. If you look at the bottom of the help you’ll this:

install.packages("lpSolve")
library(lpSolve)
?lp
# Set up problem:
# maximize
#   x1 + 9 x2 +   x3
# subject to
#   x1 + 2 x2 + 3 x3  <= 9
# 3 x1 + 2 x2 + 2 x3 <= 15

For DFS each variable or dimension is a binary variable (a 1 or 0) representing the selection of a player; e.g. if x1 == 1, then we will be drafting the Seattle Seahawks. Else, x1 == 0 and we will not be drafting the Seattle Seahawks.

Connecting this to the example in the help file, the function we want to maximize is expected points; i.e x1 * 10.8 + x2 * 10.2 + ..., where 10.8 is the expected number of points from the Seahawks and 10.2 is the expected number of points from the Chiefs. If we pick the Seahawks, then x1 == 1 and we would expect 1 * 10.8 + 0 * 10.2 + ... This is called the objective function.

f.obj <- fd[, "Points"]

The component-wise multiplication by xi is implicit in this syntactic formulation.

Next we need to set up the constraints; i.e. the “subject to” part of the help file. Getting our constraints into the format above is easy. We take our salary data, bind it to the position matrix, and transpose it.

f.con <- t(cbind(Salary = fd[, "Salary"], Position.Mat))
colnames(f.con) <- fd$Name
kable(f.con, format = "markdown", row.names = T)
Seattle Seahawks Kansas City Chiefs Houston Texans Pittsburgh Steelers Minnesota Vikings Green Bay Packers Cincinnati Bengals Washington Redskins Steven Hauschka Chris Boswell Mike Nugent Cairo Santos Blair Walsh Dustin Hopkins Nick Novak Mason Crosby Russell Wilson Ben Roethlisberger Aaron Rodgers Kirk Cousins Andy Dalton Alex Smith Brian Hoyer Teddy Bridgewater AJ McCarron Brandon Weeden Landry Jones Chase Daniel Tarvaris Jackson Shaun Hill Robert Griffin III Keith Wenning Colt McCoy Scott Tolzien Adrian Peterson DeAngelo Williams Marshawn Lynch Jeremy Hill Christine Michael Charcandrick West Eddie Lacy James Starks Fitzgerald Toussaint Jordan Todman Alfred Blue Giovani Bernard Jerick McKinnon Alfred Morris Matt Jones Spencer Ware Matt Asiata Bryce Brown Fred Jackson Akeem Hunt Chris Thompson Chris Polk Derrick Coleman Knile Davis Darrel Young Isaiah Pead John Crockett Pierre Thomas Dri Archer John Kuhn Rex Burkhead Jonathan Grimes Jordan Reed Tyler Eifert Travis Kelce Heath Miller Richard Rodgers Kyle Rudolph Luke Willson Ryan Griffin Tyler Kroft Justin Perillo C.J. Fiedorowicz MyCole Pruitt C.J. Uzomah Garrett Graham Demetrius Harris Chase Coffman Cooper Helfet Brian Parker Kennard Backman Jesse James Rhett Ellison Antonio Brown DeAndre Hopkins A.J. Green Doug Baldwin Jeremy Maclin DeSean Jackson Martavis Bryant Randall Cobb Pierre Garcon Tyler Lockett Jermaine Kearse Markus Wheaton Stefon Diggs James Jones Marvin Jones Davante Adams Nate Washington Cecil Shorts Mohamed Sanu Jaelen Strong Albert Wilson Brandon Tate Chris Conley Mike Wallace Jamison Crowder Jeff Janis Jared Abbrederis Rashad Ross Cordarrelle Patterson Charles Johnson Ryan Grant Jarius Wright Jason Avant Junior Hemingway De’Anthony Thomas Kevin Smith Frankie Hammond Adam Thielen Chandler Worthy Darrius Heyward-Bey Jamel Johnson Greg Little Keith Mumphery
Salary 5100 5100 4600 4500 4500 4500 4400 4300 5100 4900 4800 4800 4700 4600 4600 4500 8600 8400 8100 8000 7900 7100 6900 6700 6400 6000 6000 5000 5000 5000 5000 5000 5000 5000 8400 8100 7800 6700 6500 6400 6000 5800 5700 5700 5700 5600 5500 5500 5400 5400 5000 5000 4800 4800 4800 4700 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 7400 6400 6200 5600 5200 5100 4800 4800 4600 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 9500 8800 8300 7300 7200 7000 6900 6500 6300 6200 6000 5900 5800 5700 5500 5300 5300 5200 5100 5000 4900 4800 4700 4700 4700 4700 4700 4600 4600 4600 4600 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500
D 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
QB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Flex 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

One constraint we have is that we can’t spend more than $60k. So we add up the x indicator vector multiplied component-wise by the salary vector and it must be less than or equal to $60k. For all of the mathletes out there:

DFSsalaryMath

In this case, the direction of our first constraint is going to be less-than-or-equal-to (<=) and the value on the right-hand side will be 60,000. Again, the xis are implicit:

# Instantiate the vectors
f.dir <- rep(0, nrow(f.con))
f.rhs <- rep(0, nrow(f.con))

f.dir[1] <- "<="
f.rhs[1] <- 60000

Next, we are required to have 1 and only 1 defense. This requires an = for the direction of the constraint and a 1 for the rhs.

f.dir[2] <- "="
f.rhs[2] <- 1

For the other positions, we are required to have 1 K, 1 QB, at least 2 RB, at least 1 TE, at least 3 WR, and exactly 7 6 RB/TE/WR (to account for the lack of a flex).

f.dir[3:nrow(f.con)] <- c("=", "=", ">=", ">=", ">=", "=")
f.rhs[3:nrow(f.con)] <- c(1, 1, 2, 1, 3, 6)

For the full view of the coefficients, direction, and constraints similar to that in the helpfile, I’ll print out a data.frame:

kable(data.frame(f.con, f.dir, f.rhs), format = "markdown", row.names = T)
Seattle.Seahawks Kansas.City.Chiefs Houston.Texans Pittsburgh.Steelers Minnesota.Vikings Green.Bay.Packers Cincinnati.Bengals Washington.Redskins Steven.Hauschka Chris.Boswell Mike.Nugent Cairo.Santos Blair.Walsh Dustin.Hopkins Nick.Novak Mason.Crosby Russell.Wilson Ben.Roethlisberger Aaron.Rodgers Kirk.Cousins Andy.Dalton Alex.Smith Brian.Hoyer Teddy.Bridgewater AJ.McCarron Brandon.Weeden Landry.Jones Chase.Daniel Tarvaris.Jackson Shaun.Hill Robert.Griffin.III Keith.Wenning Colt.McCoy Scott.Tolzien Adrian.Peterson DeAngelo.Williams Marshawn.Lynch Jeremy.Hill Christine.Michael Charcandrick.West Eddie.Lacy James.Starks Fitzgerald.Toussaint Jordan.Todman Alfred.Blue Giovani.Bernard Jerick.McKinnon Alfred.Morris Matt.Jones Spencer.Ware Matt.Asiata Bryce.Brown Fred.Jackson Akeem.Hunt Chris.Thompson Chris.Polk Derrick.Coleman Knile.Davis Darrel.Young Isaiah.Pead John.Crockett Pierre.Thomas Dri.Archer John.Kuhn Rex.Burkhead Jonathan.Grimes Jordan.Reed Tyler.Eifert Travis.Kelce Heath.Miller Richard.Rodgers Kyle.Rudolph Luke.Willson Ryan.Griffin Tyler.Kroft Justin.Perillo C.J..Fiedorowicz MyCole.Pruitt C.J..Uzomah Garrett.Graham Demetrius.Harris Chase.Coffman Cooper.Helfet Brian.Parker Kennard.Backman Jesse.James Rhett.Ellison Antonio.Brown DeAndre.Hopkins A.J..Green Doug.Baldwin Jeremy.Maclin DeSean.Jackson Martavis.Bryant Randall.Cobb Pierre.Garcon Tyler.Lockett Jermaine.Kearse Markus.Wheaton Stefon.Diggs James.Jones Marvin.Jones Davante.Adams Nate.Washington Cecil.Shorts Mohamed.Sanu Jaelen.Strong Albert.Wilson Brandon.Tate Chris.Conley Mike.Wallace Jamison.Crowder Jeff.Janis Jared.Abbrederis Rashad.Ross Cordarrelle.Patterson Charles.Johnson Ryan.Grant Jarius.Wright Jason.Avant Junior.Hemingway De.Anthony.Thomas Kevin.Smith Frankie.Hammond Adam.Thielen Chandler.Worthy Darrius.Heyward.Bey Jamel.Johnson Greg.Little Keith.Mumphery f.dir f.rhs
Salary 5100 5100 4600 4500 4500 4500 4400 4300 5100 4900 4800 4800 4700 4600 4600 4500 8600 8400 8100 8000 7900 7100 6900 6700 6400 6000 6000 5000 5000 5000 5000 5000 5000 5000 8400 8100 7800 6700 6500 6400 6000 5800 5700 5700 5700 5600 5500 5500 5400 5400 5000 5000 4800 4800 4800 4700 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 7400 6400 6200 5600 5200 5100 4800 4800 4600 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 9500 8800 8300 7300 7200 7000 6900 6500 6300 6200 6000 5900 5800 5700 5500 5300 5300 5200 5100 5000 4900 4800 4700 4700 4700 4700 4700 4600 4600 4600 4600 4600 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 <= 60000
D 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 1
K 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 1
QB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 1
RB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >= 2
TE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >= 1
WR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >= 3
Flex 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 = 6

Now that we’ve got all of that setup, we use the lp function and pull out our picks!  Notice the `all.bin = TRUE`.  We can’t pick half of a Russell Wilson, so we our variables must be binary.  For linear programming in general we get to use non-negative real-valued numbers so that, for example, we can buy half a pallet of… oranges.?

opt <- lp("max", f.obj, f.con, f.dir, f.rhs, all.bin = TRUE)
picks <- fd[which(opt$solution == 1), ]
kable(picks, format = "markdown", row.names = F)
Name Position Points Salary
Pittsburgh Steelers D 9.4 4500
Blair Walsh K 9.6 4700
Russell Wilson QB 21.5 8600
Adrian Peterson RB 15.4 8400
Giovani Bernard RB 9.8 5600
Chase Coffman TE 8.6 4500
Antonio Brown WR 20.0 9500
Doug Baldwin WR 14.4 7300
Martavis Bryant WR 13.2 6900

Conclusion

That’s the math that I used to not win at daily fantasy sports. Like I wrote, there are more tweaks that you could use to enhance it. I used some of them and ignored others, but it didn’t really work for me. The point is: The algorithm is broadly applicable and you should think about using it outside of fantasy football. If you want the code that generated the Algorithm part of the post, it’s available here and the data is here (the file extensions should be .rmd and .csv, but WordPress is oddly picky about such things).

Go Blackhawks!

Correction: In the original version of this post I mistakenly added a flex position.  AFAIK Fanduel doesn’t have a flex option, but my code does.  I’ve since adjusted the code by decreasing the flex constraint from 7 to 6; i.e.

f.rhs[3:nrow(f.con)] <- c(1, 1, 2, 1, 3, 6)