I love XKCD. According to their website, the webcomic is about romance, sarcasm, math, and language, but after so many years, Randall Munroe explored many other topics as well. Some of them more than once.
I wanted to know the structure of this fantastic stick-figure world he created, so in my spare time I scraped all his webcomics from the interblag (or blagosphere), then extracted all the relevant words from each, and finally plotted the result below. In the following graph, each node represent a comic, and 2 comics share an edge if they contain words in common. …
Many Data Scientists use an automated tool for A/B tests, like Google’s Firebase or Optimize. These tools let you choose which metrics to pay attention to, and they automatically tell you when your test reached significant levels.
However, sometimes you need to do everything manually. That’s why Data Scientists get hired, right? Maybe the tool isn’t flexible enough for your needs, or it’s buggy and causes a disruption in user experience.
Anyway, here are detailed instructions on how to add the significance level of your test directly in BigQuery. In 3 simple steps.
First, you need to create a scheduled query in BigQuery that periodically gets the significance level of your test. I make the query run every day.The only thing to modify from the following code is how you get the metrics of your A/B test (I get them through SQL queries as I’m already logging them in BigQuery) and the name of your destination table. …
While working on my Slack bot that knows how to transform business questions into SQL and answer back, I found myself comparing the 2 most used Python libraries for natural language processing: spaCy and NLTK. Here are some differences I found — with examples.
I used this lyrics dataset from Kaggle — I recognize about 5% of the artists present.
We will be using the following helper functions:
First, I compared running times of tokenizing the lyrics. I repeated the experiment 100 times to get some statistical significance.
Notice how this invalidates the analysis here. As spaCy now supports tokenization without analyzing the semantic structure, it’s not slower than NLTK anymore. …
10 days after joining fromAtoB, in September 2019, when I was still living in a hotel, we decided to restructure our Google DataStudio dashboards. The goal was to help people understand what questions could already be answered with a chart. Simple, repeated questions about the data should not be manually answered every time they are asked. In an ideal world, people should be able to answer them themselves by quickly accessing a dashboard.
To make our dashboards more user friendly, I had the idea of developing a Slack bot. A friendly Slack bot that would redirect users to our dashboards. …
The other day I needed to revise Gephi layouts when preparing for a teaching/consultancy service for a friend who does Social Network Analysis. Here is a summary of what I re-learned.
Gephi is an amazing open-source network analysis and (interactive!) visualization software with tons of really useful tools for exploring graph data, calculating statistics, detecting clusters, communities, etc. It requires no coding skills. Even if you have coding skills, please stop using Python’s NetworkX for a bit and try Gephi. It’s worth it.
One of the very nice features Gephi offers is a bunch of different layout algorithms — that is, the way you see the graph live. …
Today I woke up and I realized that even though I love automating things, I kept repeating the following behavior:
I had always wanted to just type jupyter-notebook experiment.ipynb and for a named notebook to appear. That would save me about 10 precious seconds that I could maybe use to obsess even more about Coronavirus numbers!
So here’s a quick hack to never, ever having to rename notebooks in your life. It works perfectly well on MacOs and Linux, and you can do something similar on Windows. …
In a previous post, I described a neat trick to try when you are getting a high training score but a low test score. The idea was that maybe your test set has a different distribution from the training set, and you can actually know whether that’s the case with a bit of help of (more) Machine Learning.
At the end of the post, I mentioned that the problem can also be tackled with a statistical test: mainly, the Kolmogorov-Smirnov statistic. How does it work?
If you are a Data Scientist, this probably happened to you: you got excellent results for your model during the learning process, but when using the test set, or after deploying to production, you get a much lower score: everything just goes wrong.
Am I overfitting? Do I have a bug in the code? Am I suffering from data leakage?
Sometimes, the distribution of data in the test set is very different from the one of the training/validation set. Maybe you are working with time series and the test data belongs to our post April 2020 world 🦠. …
These last couple of years were kind of a boom for image recognition tools. Especially the ones that use GANs (Generative Adversarial Networks, an incredible idea) to achieve their goals.
Here are the new toys we can play with (can you imagine the level of madness of the tools that weren’t released to the public?):
Who hasn’t played with this Russian app? It lets you change your age, hairstyle — even modify your gender, something that caught quite a lot of attention of the transgender community. It’s available both on Android and iOS. The non-premium version is good enough to play with. Try it out! …
La estafa del telar de la abundancia (o “Flor de la abundancia”, o “mandala de la abundancia”) ya es muy conocida, y ahora resurgió empleando un discurso feminista.
La idea es muy simple: veamos la flor de arriba. Cada pétalo representa a una persona. Imaginemos que le faltan todos los Fuego — los más alejados del centro. Son éstos los que harán la inversión inicial — que ronda los 500 dólares; es responsabilidad de los 4 Aire reclutar a 2 Fuego cada uno. …