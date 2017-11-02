How do I prepare for Data Engineer jobs at Amazon, Google, Facebook, or Quora? originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Angela Zhang, Data Infra Engineer (previously Product/Infra Eng) at Quora (2014-present), on Quora:

Since the question is specifically asking for Data Engineering jobs, I won’t repeat the How do I prepare for a software engineering job interview? answers here, and only talk about Data Engineering specific preparation.

For general systems and data infrastructure knowledge, the book Designing Data-Intensive Applications is a great resource to understand the fundamental challenges companies commonly face, and some well established design solutions and concepts to solve these problems. I would highly recommend starting there.

As a Data Engineer, you are also expected to be familiar with SQL, since it’s the most common language for querying and manipulating data. SQLZOO is a pretty good resource for learning the basics. Bonus points if you can learn how to read the results of an EXPLAIN query, so you can learn to write more efficient SQL queries.

Once you’ve mastered the concepts in Designing Data-Intensive Applications and how to write SQL queries, some good open source systems to learn about next are:

1) Kafka as a message buffer for logging in near real time.

2) Spark for processing data in batches (commonly once a day) and Spark Streaming for processing data in micro batches (commonly one batch every few seconds to minutes)

There are lots of other data processing tools here (you definitely don’t need to know all of them), like:

MapReduce, Tez, Beam, Flink, Flume for batch processing instead of Spark.

instead of Spark. Storm, Samza, Heron, Flink Streaming for stream processing instead of Spark Streaming.

instead of Spark Streaming. Presto for running interactive SQL queries on different data stores.

3) HDFS for storing data on disk across many machines. A lot of Apache open source data storage systems are based on HDFS, such as HBase and Hive.

4) Zookeeper for managing configurations and consensus. For example, HBase uses Zookeeper internally to keep track of which box is the HBase master and which RegionServers are currently alive. For a super in-depth explanation on some use cases of Zookeeper in HBase, see: Eric Sammer's answer to Why does HBase use Zookeeper but HDFS doesn't?

5) YARN or Mesos for managing resources on a cluster, abstracting away the concepts of individual machines, each with its own CPU and memory limits, and thinking of the cluster as a pool of X CPUs and Y GBs of memory.

6) Airflow for scheduling tasks and managing task dependencies (e.g. I only want to run the “generate revenue report” task after I’ve run the “process sales” task for today).

For the purpose of preparing for the job, it’s not important to know the details of each of the systems above, but it’s important to understand how they fit into the data ecosystem, and how they work on a high level.

Learn a JVM language. A lot of the systems above are written in JVM languages, and though many of them provide non-JVM APIs, their JVM APIs tend to be much more powerful and efficient. In particular, Java and Scala are probably the most widely applicable languages within the JVM family.

Lastly, general front end skills are also useful. As a Data Engineer, you might be asked to build data visualizations and other data tools for your team. Knowing React and maybe D3 can come in handy in these situations.