How To Become a Data Engineer – Part 1

We are in a new age where AI is booming and becoming more prevalent and companies are rushing to hit the gold mine with extracting the value of this technology.

But in order for companies to be successful with AI they need a strong foundation where a reliable source of data can feed these models.

That foundation is built by Data Engineers.

Data Engineers build and provide the fertile soil needed for companies to extract value from their data and grow their assets.

A question on r/dataengineering that gets asked frequently is:

How do I become a Data Engineer?

Before I answer this question, let’s first understand what a Data Engineer is and what value they bring through an example.

A popular example of AI right now is ChatGPT.

You chat with this bot and it magically provides you answers to your pressing questions or issues.

How did this bot come up with those answers? There’s a combination of machine learning, algorithms, and statistics going on under the hood. But how is the model learning human text language and running all these calculations? It’s using data.

Where does that data come from? And how does it get sourced?

You guessed it. Your one and only, Data Engineer.

Data engineers are key to powering AI and making sure all the data needs in the world are met. We grab data from different places, clean it up, and make it ready for clients to dig into.

Our job is to make sure the data is solid, so people can pull out the best insights and make smarter decisions.

So What’s The Sauce?

It’s very important to have fundamental knowledge of a job before you do the job.

Before you can become a Data Engineer, you should have familiarity working with data.

For those in school, this doesn’t mean you can’t go straight from undergrad or grad school to a Data Engineer role. You should absolutely strive for that and go for it. Nothing should stop you from applying if you have the experience and skills to back it up.

But not all of us have the opportunity to take shortcuts. And if there’s one thing I’ve learned in life, your career is rarely linear.

Sometimes your career is a game of Chutes and Ladders. It’s ok to take a lower paying job or title, if it means you can climb into a better job later down the road. Be a long term player in life, not a short term player.

Some of us are in-transition.

Regardless of your current status in life, you’re reading this blog because you suddenly find yourself wanting to join this new AI movement and work in the tech industry as a Data Engineer. And I’m happy to share my experience so that you know one of the many paths towards getting there.

Also, I want to point out that some of the best Data Engineers I know didn’t have a typical Computer Science degree. They came from all kinds of backgrounds and I believe having diverse experiences make you a unique individual who will add incredible value to any Data Engineering team.

Step 1: Get Experience at a Low Level

One of my recommendations is to start at the lowest level, a Data Analyst that does ETL or ELT type of work.

The reason is because Data Engineering is not an entry level job.

You will be struggling a lot and be quickly overwhelmed if you don’t know the fundamentals of data engineering. Which is why pivoting to a Data Analyst job is a great stepping stone.

Giving yourself the foundational experience of Data Engineering through a Data Analyst role is going to be key for advancing your career.

By the time you apply for a Data Engineer job, you’ll already have some experience doing the skeleton or backbone of the actual work and your skills will be transferrable.

And as a Data Analyst, you will hopefully get your hands on the dirtiest and un-normalized data. I love this idea because this is exactly how I learned how to work with data.

I used to work for a software company that provides a POS (Point of Sale) system for car manufacturers.

I would receive data of car parts in Excel sheets dropped into an SFTP and I’d have to figure out the primary keys in each dataset in order to perform joins and perform further manipulation to create the final dataset to load into a Microsoft SQL Server database.

This database feeds the POS system that would then allow a client to search their own car part number in the UI.

Step 2: Get Your Hands on Dirty Data

What was fun about this job is that the client would sometimes send us the whackiest data.

The Excel sheets would have pictures embedded into them, rows would sometimes be merged forcing you to unmerge them, formatting was broken because Excel likes to do its own thing when representing specific kinds of numbers or dates.

It was a hot mess to say the least. A lot of cleaning, manual work, and face palming ensued for me and the team.

I didn’t realize it at the time, but this was a blessing in disguise.

Having to work with the dirtiest data helped me learn how to clean data and turn it into this diamond when I eventually served it up to the database.

For example, when loading data into our database the file format we use is CSV. And because of the nature of CSVs, you can’t have a comma mixed in the value of your data. Otherwise, the database would interpret your comma as a separator and you would be loading an incorrect value into the wrong column.

Let’s say you had an Excel file full of car parts and one of your values was Oil, filter in a column called Description:

+------------+-------------------+------------+
| PartNumber | Description       | Date       |
+------------+-------------------+------------+
| 12345      | Oil, Filter       | 2025-02-01 |
| 23456      | Air Filter        | 2025-02-02 |
| 34567      | Spark Plug        | 2025-02-03 |
+------------+-------------------+------------+

If you left that comma in the description then your data would get parsed incorrectly.

You would see the data appear as Oil in the Description column and filter in the Date column. Then your database would yell at you because you can’t have a string in a date type column:

+------------+-------------------+------------+
| PartNumber | Description       | Date       |
+------------+-------------------+------------+
| 12345      | Oil               | Filter     |
| 23456      | Air Filter        | 2025-02-02 |
| 34567      | Spark Plug        | 2025-02-03 |
+------------+-------------------+------------+

When really, you wanted store Oil filter in the Description column as a single value.

Imagine having to do this cleaning task in multiple files with hundreds and thousands of rows and enough columns to make your eyes glaze over.

Now, the oonga-boonga way of doing this is to use Excel and replace all commas with an empty character manually.

But there’s a much better tool to automate all of this. And that’s Python.

Some cleaning tasks were so repetitive that I got tired of it so I would create my own scripts in Python to automate redundant tasks.

I’ve never programmed in my life, but I knew if I wanted to do my tasks efficiently and automate the way I work with data, I knew I was going to have to code at some point.

This was the prime time to teach myself and I dove head first.

Step 3: Learn the Fundamentals of Python

I could go on all day about the many things I love about this programming language.

Ultimately, I committed to learning the language and this was the hammer I needed to solve my problem at work.

And if I can dream my solution, I can build it.

After a lot of Googling, debugging, and testing I wrote a script that I turned into an executable file (.exe) and when you ran it a window would pop up asking for the path to your input file and the new file name you wanted to rename the new output file.

Under the hood, I was using a popular package called Pandas to remove commas in values, remove other special characters, rename the file, and output a cleaned Excel sheet.

Once I got this working I distributed this to my team and we each used it to clean client files and we were tripling our speed of development.

All while reducing errors and reserving our energy and brain power for further data manipulation.

Here’s my GitHub repo for this project: https://github.com/agbulosk/Clean_File

If you can find any opportunity to automate or semi-automate a redundant task you do at your current job, you’re already on your way to becoming a successful Data Engineer.

You can also apply this same goal in your personal life and create fun side projects.

What are redundant tasks you do all the time in your downtime? Could those tasks be automated or semi-automated?

Also, I highly recommend checking out my favorite resources to learn the basics of Python:

Step 4: Learn the Fundamentals of SQL

When working with data, it’s inevitable that you will end up working with a database.

Databases are a common way of storing and managing data.

I recommend you learn the fundamentals of how a database works so that when you encounter a database in the wild, you will be prepared and know how to handle them!

Depending on the type of database, you may encounter different dialects of SQL.

But don’t stress, all of the dialects are pretty much the same so once you learn one flavor of SQL you’ve pretty much learned them all.

One dialect of SQL I recommend you learn, since they’re easily translatable anyway to other dialects, is T-SQL. This is Microsoft’s flavor of SQL.

And the one resource I found for learning T-SQL to be the most informative, concise, and valuable is this book:

One book as your bible for learning SQL. Pretty nice right?

What I love about this book:

  • Covers fundamentals all the way to advanced concepts.
  • Provides you a free sample database you can use in SSMS.
  • Includes problem sets after each chapter along with solutions.

To top it off, this book was thorough enough to help me nail any technical assessment or SQL questions during my interviews for Data Engineer roles! The tools and concepts you learn in this book are invaluable.

What’s great is that some concepts you learn in T-SQL will also be transferrable to Python and vice versa.

Both languages go hand in hand and pair extremely well.

Conclusion

Key Takeaways:

  • There’s no shortcut to being a Data Engineer.
  • Get experience as a Data Analyst so you can build the foundational experience to excel at working as a Data Engineer.
  • By building that foundation, you will show employers your skills and experience are transferrable and you can handle the work of a Data Engineer even if you don’t check every box they have in the job description.
  • Find dirty data that needs to be cleaned, transformed, and loaded into a storage place. This is the basic concept of an ETL pipeline which we’ll dive more into part 2 of this blog.
  • Learn the two key programming languages that serve as your bread and butter: SQL and Python.
  • Leverage these two languages to automate the way you handle data.

I believe anyone can be a Data Engineer if you are driven enough.

And while it doesn’t hurt to get a Computer Science degree, certification, bootcamp, or masters degree to gain an edge in the job market for landing a Data Engineer role, I’m your black sheep example of someone who didn’t pursue any of that and still landed the job.

I worked in many roles in IT throughout my career and I didn’t have a clear sense of where I was going until 8 years had passed. As a result, I got tired of working jobs I wasn’t passionate about.

Ultimately, when I set my sights on working with data and knew I enjoyed building software to automate tasks that wrangle data, that’s when I discovered my path was clear.

My focus was narrowed to one job title, a Data Engineer. And that made it easier for me to figure out the scope of all the experience, tools, skills that a Data Engineer needs to know because I knew my destination.

Therefore, it made complete sense that my next move was to land a Data Analyst job and from there work my way up to Data Engineer.

There were many times I’ve had imposter syndrome since I didn’t think I’d have enough boxes checked to transition to a Data Analyst role.

But all the time I had spent teaching myself paid off because I could tell I was exuding confidence and I was saying the correct answer off the bat in each interview round.

This was one of those golden moments that taught me a valuable lesson in life. Invest in yourself, your education, and continuously improve a craft you’re passionate about. If you love working with data, hone your gifts and stay insatiably curious about learning everything that centers around your craft.

By the time you get into the interviews you’ll have a breeze getting through all the questions and impressing your employers.

Stick around for part 2 of this blog and in the meantime ask yourself a key question: what can I automate right now in my job that involves data?