5 Pre-Commit Hooks Every Data Engineer Should Know – Data, Life, and Everything in Between

Whether you’re coding for a living or as a hobby, writing quality code is a skill worth honing.

As humans, we all make mistakes—and that’s okay. Perfection isn’t the goal, but improvement is.

In a professional setting, once you’ve written your code and finished a project, a colleague usually steps in to review your work. But what if you don’t have a peer available to help fine-tune and catch mistakes?

That’s where pre-commit hooks come in.

World renowned Chef Carmy from the show The Bear.

Think of them like Chef Carmy from The Bear—your go-to guy for checking errors, making adjustments, and giving you feedback before things get sent out the door. No drama, just solid advice for improvement.

Pre-commit hooks aren’t just about catching mistakes; they’re a tool that can help you write better code, learn more, and make your future self (and fellow developers) grateful for your clean, readable work.

Writing great code is an art and pre-commit hooks are one of the tools that can help you get closer to mastering it.

Table of Contents

What Are Pre-Commit Hooks?

Before we dive into my 5 favorite pre-commit hooks we first need to understand what is a hook.

Basically, a pre-commit hook is a program that will automatically run in your repo when you run a git commit command in your terminal.

Depending on the hook, it will usually scan your code for errors or rule violations that the program is designed to catch.

These hooks get ran from a python package called pre-commit.

You can learn more on pre-commit’s website.

Setting up the hooks is super easy. In your repo:

Create a file named .pre-commit-config.yaml at the root of your directory.
Run pip install pre-commit in your terminal
Then run pre-commit install
Done! Now when you run a command like git commit the hooks you’ve specified in the yaml file will get ran in order from top to bottom.

The yaml file is where you’ll configure and list your hooks.

You can browse a list of hooks available here.

And adding a hook will look something like this in that yaml file:

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace

The keyword “repos” is where your hooks will be stored and each hyphen will represent each hook.

You can add as many hooks as you want in this file!

As you find hooks that you like, they’ll provide you the code in the above format that you need to copy and paste into your yaml file. I’ve linked them throughout this blog for each hook.

They’ll also show you how you can configure the hook to include or exclude certain rules. Feel free to browse or Google the documentation for each hook.

isort

The first pre-commit hook I’d like to introduce is isort.

This hook is great for sorting your imports in your Python modules.

Let’s say you had the below imports in a Python file:

Before using isort:

import sys
import os
from collections import deque
import numpy as np
import pandas as pd
import requests
import mymodule

After using isort:

# Standard Python Libraries
import os
import sys
from collections import deque

# 3rd party packages
import numpy as np
import pandas as pd
import requests

# Your module
import mymodule

Notice how neatly the imports are grouped and ordered:

Standard Library Imports: These are grouped and ordered alphabetically first.
Third-Party Imports: Libraries such as numpy, pandas, and requests are sorted alphabetically and grouped separately from the standard library imports.
Local Application Imports: Any local imports are placed at the bottom.

This makes it so much cleaner and easier to read especially if your project has a lot of dependencies.

This is a great tool because as your project grows in size and complexity, you might be adding tons of imports and loading in a lot of methods from various packages.

isort will take care of parsing and organizing it all at the top of your module.

Test it out! Try loading 10 or even 20 methods from one package and see how isort gracefully handles the organization.

black

I can’t think of any single tool in my entire programming career that has given me a bigger productivity increase by its introduction. I can now do refactorings in about 1% of the keystrokes that it would have taken me previously when we had no way for code to format itself.

The next hook is actually my first introduction to code formatting for Python: black

This is my favorite code formatter for Python!

What I love about black is it’s opinionated.

It has its own paradigm, rules, and views on what makes beautiful code.

And I’m perfectly ok with ceding control and letting black do what it does best: making your code highly legible.

If your code is easy to read and understand, you’re already doing great. Black helps take that quality a step further, turning it into a reality.

Brace yourself, the below Python code might give you a stroke:

def my_function( x,y, z  ,   arg2  ,arg3):
  if x  ==y:
   if   x==z  :
   print("x is equal to z")
  elif   x!=y:
      print( "x is not equal to y" )
  else:
    if x>z:print("x is greater than z")
    else:
      print("x is less than z")
  return    True,    arg2,arg3

Now, let’s look at how black handles this mess:

def my_function(x, y, z, arg2, arg3):
    if x == y:
        if x == z:
            print("x is equal to z")
    elif x != y:
        print("x is not equal to y")
    else:
        if x > z:
            print("x is greater than z")
        else:
            print("x is less than z")
    return True, arg2, arg3

What did black do?

Spacing Around Commas: black ensures consistent spacing around commas in the function definition (e.g., x, y, z, arg2, arg3 instead of x,y,z,arg2,arg3).
Indentation: black fixes inconsistent indentation (e.g., the elif and else blocks are properly indented).
Whitespace Consistency: Ensures that there is a single space around comparison operators (==, !=, >, etc.).
Line Length and Breaking: If necessary, black adjusts line lengths for better readability.
Extra Whitespace: Removes unnecessary whitespace (e.g., around the ==, != operators, and after print()statements).

As you can see, black makes the code cleaner, more consistent, and adheres to PEP 8 standards.

What’s also amazing with black is it will neatly parse out large functions with tons of parameters.

The below example is an exaggeration, but you’ll see the point.

Here’s a complex function before using black:

def complex_function(param1, param2, param3, param4, param5, param6, param7, param8):
    result = param1 + param2 - param3 * param4 / param5 + param6 - param7 * param8
    if result > 0:
        return True
    else:
        return False

After using black:

def complex_function(
    param1,
    param2,
    param3,
    param4,
    param5,
    param6,
    param7,
    param8,
):
    result = param1 + param2 - param3 * param4 / param5 + param6 - param7 * param8
    if result > 0:
        return True
    else:
        return False

Notice how much easier that is to read where each parameter is separated on its own line giving each one space to breathe.

This makes it easier to absorb and understand what’s going on in complicated functions.

ruff

If you want to guard against mistakes in your own code and catch human errors before you even make a single commit in git, look no further.

Ruff is a beastly linter.

It will statically analyze your code for any potential issues.

In a professional setting, your peers will review your code before it reaches production.

These are called pull requests.

It’s the responsibility of the code reviewer to lend an analytical eye and catch any mistakes in the user’s code.

But what if you can catch these mistakes before that point?

For example, let’s say you have an unused import that you left by accident:

import os

def greet(name):
    print(f"Hello, {name}")

When you run the ruff pre-commit hook you’ll see something like this in your terminal:

$ pre-commit run ruff --all-files
ruff.................................................................... 1/1
- example.py:1: Unused import 'os'

It’s easy to forget during a project where you no longer need a dependency and forget to remove the import for that package.

This is just one example of the many mistakes ruff can catch for you.

But this is why ruff is great. It can catch many human mistakes and help you produce a higher quality commit.

pydocstyle

Documentation is another key consideration in writing quality code.

There’s a saying that you read code more than you write code.

It’s vital to keep in mind that you’re writing code for the future.

Future you, future maintainers, colleagues, or anyone who has visibility into your work.

Which is why docstrings are so important in the development process.

They help provide context, information, and descriptions of what your code does.

This is why I love this next hook called pydocstyle.

It will automatically check if docstrings are lacking at the module, class, and function level.

$ pre-commit run pydocstyle --all-files
pydocstyle................................................................ 1/1
- example.py:1: D100 Missing docstring in public module
- example.py:2: D101 Missing docstring in public class
- example.py:4: D103 Missing docstring in public method
- example.py:7: D103 Missing docstring in public function

What’s great is you can configure the rules of this hook.

For example, if you or your team agree that requiring a docstring at the module level is unnecessary, you can easily disable that rule.

mypy

One of the drawbacks with Python is that it’s not type-enforced.

Meaning, you can add a type hint to a variable like the below which isn’t accurate:

# This is actually a string
number_holder: int = "5 is a number"

This can be very problematic if the wrong types are used because it can lead to misinformation, making it harder to debug issues.

Which is why mypy is here to make sure you’re using types the right way in your variables and functions.

Let’s look at another example and show how mypy will guide you towards fixing your types:

# example.py

def add(a: int, b: int) -> int:
    return a + b

def greet(name: str) -> str:
    return "Hello, " + name

def calculate_area(radius: float) -> float:
    return radius * radius

def incorrect_type_example(x: int) -> str:
    return x  # This should return a string, but it returns an integer

You can see above that the type hints for the first 3 functions are correct. The type hints for the parameters and return types look spot on.

But if you look closely at the last function, it takes a parameter of int and returns an int, yet the return type is str.

Clearly this is a mistake. Let’s see how mypy will address this:

$ pre-commit run mypy --all-files
mypy.................................................................... 1/1
- example.py:9: error: Incompatible return value type (got "int", expected "str")

mypy gives very clear explanations of why a type is improperly used.

What’s also great is it gives you the line number in which the code failed so you can quickly identify the root cause of the problem!

Conclusion

Pre-commit hooks are an essential tool in a Data Engineer’s toolbox to produce high quality code.

High quality code means easier legibility, increased clarity, and ultimately, a faster way of understanding the flow of logic.

These hooks make the code review process easier for your colleagues because common human mistakes will have already been caught and corrected.

Even if you don’t do code reviews with peers and you code as a hobby, this helps future-you refer back to your code weeks, months, and years later.

Overall, this provides a more seamless experience during development and helps produce professional repositories during projects.

I believe in writing quality code first.

By writing great code from the beginning, you’re lessening the technical debt for yourself, your stakeholders, and peers.

And you’re working more efficiently by catching mistakes now vs later. Nobody wants to spend extra time to fix preventable issues.

Give it a try in your next project!

Try one hook or several hooks at a time and see what makes sense for your project needs.

And if you have any favorite hooks, I’d love to learn about them in the comments!