Goran Peretin

Jupyter Notebook Setup for Research and Production

October 23, 2024

Over time, I've accumulated over 100 Jupyter notebooks containing various research, strategies, reports, and other trading-related things. Versioning Jupyter notebooks can be a bit of a pain, and there are packages that help with that. Of those, I've tried jupytext but never stuck with it. To prevent a disaster in case of a disk failure, I've set up twice-daily off-box backups of the notebooks folder on the server.

I wanted to improve on that setup and, while I’m at it, solve a few other issues:

Code reuse – There's a common setup that repeats across notebooks, where some behavior is reused in many of them, e.g., data fetching, some DataFrame manipulation functions, and graph generation. It would be great to have that behavior extracted into a single place.
Editor – I've been using Jupyter's browser editor for 4+ years now, and it's time to upgrade to something better—ideally, a proper text editor.
Notebook execution – I have notebooks where I've iterated on something to produce a report, and now I’d like to have them run daily. So the ideal solution would involve a way to run a Jupyter notebook, collect some results, and store them somewhere.
Local and remote – My entire setup is currently running on a Hetzner dedicated server (a future post on this is incoming), and some portions of the new setup will still have to run there because (a) that’s where the data is (and it's not trivial to replicate that amount of data) and (b) batch jobs. However, I would like to move some parts locally, e.g., development and testing of notebooks.

Editor

Python snippet in VS Code interactive mode

Python snippet in VS Code interactive mode

I recently discovered VS Code Interactive mode for Python, and it checked off some of the boxes I had for a notebook/development solution. I decided to try it and liked it a lot. The main reason I liked it is that the notebook is valid Python code, not code intermingled with output in a JSON blob like .ipynb. The cell-focused development workflow is still present, though. There are also some nice features like outputs in a separate panel and a solid data explorer. So, even though I don't use VS Code as my main editor, I switched to it for notebook development.

Having notebooks as pure Python files also solved the versioning issue, as well as code reuse. I can now simply extract common code into another Python module and import it where needed.

Batch Jobs

Next up was figuring out how to run a notebook as a batch job. I'm currently using Dagster as a workflow automation tool for my trading, which includes running anything that resembles a cron job. So, I needed a way to somehow run the .py notebook from a Dagster job. One solution was to wrap the code in the notebook in a function, then import that function and run it, however, I would then lose the notebook-centric development style of working and iterating on individual cells. Wrapping each cell into its own function just sounded too cumbersome – I didn't want to have to make changes to a notebook just because I wanted to run it as a batch job.

The solution ended up being simpler than I thought it would be: since all the code in the notebook is top-level, meaning it’s executed on import, all that was needed was to import the notebook from my batch job runner. importlib to the rescue! Additionally, when developing/testing a notebook, I don’t always want to store or persist the results. Sometimes just viewing a DataFrame or any other output is enough. However, for running batch jobs, I usually want to store the results, which means I needed a way to fetch some outputs from the notebook.

Here's the code snippet that does both of the above:

import importlib


def run_notebook(notebook: str, results: list[str]) -> list:
    """Runs a notebook with a given name.

    Since notebooks contain top-level code which is executed on import,
    this will just import the notebook and return the variables from
    the module namespace.

    Returns a list of variables matching the input parameter `results`.
    Variables can be anything that's defined in the notebook—DataFrames,
    lists, simple types, etc.

    """
    module = importlib.import_module(notebook)
    res = []
    variables = module.__dict__
    for r in results:
        if r in variables:
            res.append(variables[r])
        else:
            res.append(None)

    return res

So, if I wanted to run a notebook called performance_report and fetch the perf_df DataFrame, I would do:

perf_df = run_notebook("performance_report", results=["perf_df"])[0]

The results are then persisted from within the Dagster job. I have that snippet alongside the notebooks in the research git repo, so anything that wants to run any notebook just calls that function. In my case, my Dagster deployment has a dependency on the research git repo, so any notebook in the repo can be executed like this.

Future Improvements

One thing I'd like to try is running VS Code on the server and using the remote development feature to develop on the server (thanks Senko for the suggestion). This would help with notebooks for which I can’t download a sample test dataset to my workstation for development and testing. It should also make it easier to develop on the go since the handoff between machines would be simpler in that case. However, I haven’t found my current setup (commit + push to git + pull on another machine) to cause much friction yet.

Deliberate practice in software engineering

July 28, 2017

Recently I read a book called Peak, from Anders Ericsson, in which author describes how people become good (world class good) at what they do. I don't want to spoil it for you, but I'll say that author believes that talent has nothing (or very little) to do with it. Continued, persistent, focused and targeted training is the key. Author calls it deliberate practice.

There are many examples in the book on how exactly deliberate practice looks like for people mastering chess, music instrument or a simpler task such as trying to memorize a lot of numbers.

They all come down to these main principles:

Long term – The efforts these people put in is often times spanning decades.
Hard – The practice has to be outside of one's comfort zone. It is hard to learn something new and improve if we keep repeating what we already know. I also think that this is what differentiates people with “10 years of experience” and “1 year repeated 10 times”.
Specific – Practice has to be focused on a specific area, for example, playing a certain note, mastering a specific chess strategy or achieving a certain milestone. Progress has to be measurable.
Feedback – Proper feedback is critical. Not only this helps us improve in the right way, but it is often times also crucial for motivation to keep working, which is required over such a long time.

Author provides much more evidence in the book as to what exactly happens during deliberate practice, but here's the brief: Deliberate practice helps to develop new mental representations that are held in long term memory. Expert performance is the ability to see patterns that seem random or confusing to people with less developed mental representations, therefore the main goal of deliberate practice is developing new mental representations.

There are two prerequisites for deliberate practice to be most effective:

Reasonably well developed field – Best performers have attained a level of performance that clearly sets them apart from people just entering the field, and we can easily identify those experts.
Teacher – Teacher provides practice activities designed to help a student improve his or hers performance.

When it comes to software engineering, we're a bit unlucky, as our field doesn't lend itself nicely to deliberate practice. It is a reasonably new field and we're still trying to figure out what exactly sets experts apart. Additionally, it is sometimes hard to measure progress in software engineering performance. There are areas that are measurable such as algorithmic programming which even has a competitive scene, but those hardly encompass what we as software engineers do. Luckily, I think we're over the “how many lines of code can one write” as a measure of performance.

Furthermore, author says: Deliberate practice develops skills that other people have already figured out how to do and for which effective training techniques have been established. I don't think we've figured out software engineering yet and we definitely don't have effective training techniques. I also don't think this is necessarily a bad idea, it just means that more things are left on us to experiment with. Additionally, our field and environment in which we work change so fast it will be hard to establish techniques that will be valid in a couple of years or decades. Computers we have today are vastly more powerful than those of 20 years ago, problems we're facing today are very different and require different approaches and skills to solve and business requirements have evolved as well. While we're on this topic, I don't think our tools have evolved proportionally – we're still using the same tools and methods to develop software we did 40 or 50 years ago and seems like we're convinced a new JavaScript framework will save the day. This is a topic for another post, but Bret Victor has some great content on this.

What can we do

However, we can get close. While we might not know what exactly does it mean to be an expert or a master in software engineering, we can still employ the ideas of deliberate practice to improve.

Here are some things that worked for me:

Try out different programming languages – If you're using Java or Python at your day job, try Clojure, Scala or Haskell. Pick a different programming paradigm, that will make it easier to achieve your goal of creating mental representations. I haven't realized the benefit of this until a job change forced me to try another language, and now I'm happy I did. While you're experimenting, don't get bogged down by the syntax, tooling, ecosystem and libraries, these are all incidental to your mission here. Focus on how and why are things done differently? For example, if you're starting with Haskell, ask yourself these questions: how do we deal with state here? How is that different from what I know? What are the benefits and drawbacks of using pure functions and how does that impact the program structure? Do they have exceptions here and how to deal with those? What exactly does lazy mean?
Try doing the same thing different way – If you usually write the code first, then tests, try doing it the other way around. If you develop core abstractions first and work your way out from there, try developing the API first and work down.
Take on a scary task – One of those for which you would say “no way I know how to do this”. This can be contributing to an open source library you use or documenting a certain part of a system you know nothing about.
Dive deep – Understand how a library, tool or system works. Ask yourself why does it work that way, would you make it other way? Try to find out the context that engineers had when they were designing it. Don't just glance over the code, draw out diagrams, what are the inputs, outputs? Which data stores are being used? How does the API look like? For which use cases do you think this would work well, for which would it not?
Find your expert – Find a friend or coworker that you know is better than you in a certain area. Discuss his or hers approach to working in that area, best practices and ask for suggestions on tasks you could do to improve. Request feedback and discuss solutions.

Lastly, remember, by definition deliberate practice is outside of your comfort zone, which means it will and should be hard. If you're struggling, that's good, keep it up.

Client-side Termplating with Jade and Node.js

March 13, 2012

At my current job, we heavily use client side templating, specifically Jade template engine. We've been using Jade for couple of months now and are very happy with it. In this post, I will describe how client side templating works and how to set up development environment that allows you to work on templates and then view them in browser quickly.

We use Jade in client side mode, which means that templates are compiled to JavaScript on server (using Node.js) and populated with values on client. If this sounds weird at first, read on, it should get clearer by the end of the post :)

Whole workflow looks like this:

Edit .jade template file
Compile template files to JavaScript
Refresh page in browser to see changes

Key here is detecting that there was change to one of .jade files and recompiling templates; you definitely don't want to do that by hand.

Compiling templates

Compiling templates in Jade is very simple:

https://gist.github.com/3065949.js?file=sample_jade_compile.js

Now you have func which is JavaScript function that you call with values you want to render your template with and func returns you HTML that you just need to put somewhere on page. Since func is executed in client's web browser we need to save func to a JavaScript file (let's call that file templates.js) and serve it as any other static file. So, templates.js will contain all of our Jade templates compiled to JavaScript.

Since templates are usually in .jade files, we need to parse all files and compile them. For that purpose, I wrote this short Node.js script:

https://gist.github.com/3065819.js?file=jade_compile.js

In order to run this script, you need to have Node.js and Jade installed. After you install Node.js, Jade is simple:

npm install jade

Now run the script:

node jade_compile.js templates/ output/

If you go to output folder, you will see a file called templates.js which contains all your templates. You do not have to know how compiled templates look, but if you are interested in how exactly Jade compiles templates to JavaScript you can examine templates.js file. Inside, you'll find a bunch of JavaScript functions assigned to variables, one variable for each .jade file found in templates folder. Eg. if you had a file called login.jade, then templates.js would contain a function named login which, when called, returns HTML for your login form.

This is probably not the best way to do this because you have to link templates.js in your client side code, and since we know that JavaScript sucks at namespaces, if there is something else called login there would be name clashes. I deal with this by putting _tpl suffix on each of our .jade files, so we can access templates in browser by issuing simple login_tpl(). If your application is really large, probably the best solution would be to introduce namespaces to templates.js file.

Detecting changes

Now that we have a way to compile all our templates, we need some way of detecting changes to .jade files and run compilation automatically. I use Watchdog library that comes with simple watchmedo command:

https://gist.github.com/3066361.js?file=run_watchmedo.sh

Basically, this command tells Watchdog to start monitoring *.jade files in templates folder, and if anything changes, run node jade_compile.js templates/ output/ command.

Performance

If you are worried about compiling all your templates every time you change something in any template, don't be. Node.js is very fast with this compilation, you probably can't Alt+Tab to browser faster than Node compiles your templates. If you have a lot of templates, try this way first and if it turns out too slow for you, then introduce some logic to template change detection process. First thing that comes to my mind is separating templates into smaller chunks (multiple folders) and treat each of them as described above.

Conclusion

This whole process can seem a bit complicated when you read it first, but in reality it's really integrating couple of tools the way you need them to work. It took me couple of hours to set it up and I haven't touched it for couple of months now, it just works.