The Importance of Reading Documentation

Jye Sawtell-Rickson
4 min readJul 6, 2018

--

As a wannabe Data Scientist I use pandas a lot. Not a day goes by where I’m not loading up a DataFrame and analysing it some way or another, but little did I know, I was only living my life at 50%. Just recently I stumbled upon some new Pandas features while looking up a reference in the documentation and boy was I surprised.

I’ve learnt almost everything I know from reading other’s code, competing in competitions and a smattering of courses. While this approach has helped me to pick up things fast, it hasn’t made me the most efficient, or clean. While my code works, it can be cumbersome at times, especially since I don’t always use the most ‘pythonic’ structures.

Read on to reach efficiency levels over 9000

Query me Amazed

In particular, I’m regularly doing something like:

transactions[(transactions.booking_date==’2018–01–01') &(transactions.booking_country==’Singapore’) &(transactions.basket_size<100)].revenue.sum()

This would tell me about all the transactions which were booked on the 1st day of 2018, in Singapore for under $100; pretty standard stuff. What I learned this week, is that instead, it can look like:

transactions.query(‘booking_date == “2018–01–01” & booking_country == Singapore & basket_size < 100’).revenue.sum()

That might not look like a whole lot to you, but to me this is so much cleaner, and more intuitive to write too. Rather than creating a Boolean Series for each condition you want, you feed the conditions into Pandas and it does all the work for you. With this, you can string together conditions as you would say them and exclude all the necessary DataFrame names. Admittedly, this only works for basic queries, but the basic queries can make up a significant portion of my day-to-day data analysis and hence can’t be neglected.

Why Stop at One?

With this little discovery I had to go deeper, another two big wins for me were:

  • .eval(): which allows you to evaluate on columns in a much more succinct form, similar to .query() above. It can also assign new columns in the same line (wow!).

For example, if I want to load some financial data and then calculate our profit, I could do something like this:

df = pd.read_csv('my_data.csv')
df['profit'] = df['revenue'] - df['cost']

But with the power of .eval, I can transform this into a one-liner.

df = pd.read_csv('my_data.csv') \
.eval('profit = revenue - cost')
  • .assign(): which allows you to assign a new column in-line for arbitrary functions. This is similar to .eval above, but more powerful.

For example, if in the above example we actually wanted to calculate the logarithm of our profit (what? your growth is too slow to plot on a logarithmic chart?), we would do something like:

df = pd.read_csv('my_data.csv')
df['log_profit'] = (df['revenue'] - df['cost']).apply(np.log)

But we can transform this is as follows:

df = pd.read_csv('my_data.csv') \
.assign(log_profit = lambda x: np.log(x.revenue - x.cost))

Most importantly, this one allows you to string together complex operations, which were previously forced to be broken up in my code.

Next Steps

At this point I was hooked and I wanted to read more about reading documentation; turns out there’s a decent amount of information on the web! The pro-tips I was able to distill are:

  1. Read the introduction: See how this point is first? There’s a reason for that. Reading the introduction or summary of a package is a great place to start as it should given you a firm understanding of how the package should be used.
  2. Check the code examples: Reading documentation is good, but sometimes for entry-level coders, or even the more seasoned lot, seeing an example of how a new function is used and how or why one might vary the inputs can be extremely beneficial.
  3. Don’t stop at one source: Following the above point, seeing more is better. Almost all the commonly used python libraries will have a blog post written on them — there are just that many people who love Python — read these, contrast them to the original documentation, even talk to the author! (hint-hint) One piece of documentation can’t cover every possible way to look at the code and it can’t cater for every audience, as such, different sources of documentation can be great for getting further perspectives.
  4. [BONUS] Explore the repo: As someone who is still trying to learn all they can, I find that browsing through the code in the repo can also be enlightening. Despite being restricted by a style-guide, everyone still has their own style of coding. I find that browsing through random parts of the repo hosted on Github can teach you new ways of structuring code and in general help you to better understand where the author was coming from.

The Point

Reading documentation won’t make you a pro-coder overnight, but it will have some nice roll-on effects.

  • Improve your understanding: At this point, I’m getting repetitive, but it must be said: reading documentation will improve your overall understanding and make you a better coder.
  • Become faster: One interesting and exciting thing about coding is that there are infinitely many ways to do something. Whichever way you’re doing it now, probably isn’t optimal (as I found out above). By taking the time to read documentation you’ll be more efficient when it comes to your work.

Conclusion

What I learnt from this experience is that even if you’re a regular user of something, there is still much to learn. In future I’ll make an effort to read through documentation of my commonly used libraries to make sure I’m getting the most out of them.

Further Reading

Here are some direct links to key Pandas pages, so you have no excuses:

--

--

Jye Sawtell-Rickson
Jye Sawtell-Rickson

Written by Jye Sawtell-Rickson

Talking about data science, product analytics, and artificial intelligence.

No responses yet