
Overview
We are seeking a highly skilled Big Data Engineer with strong experience in Lakehouse architecture and distributed data processing to design and build scalable data platforms. The role focuses on developing end-to-end data pipelines for both batch and real-time processing using modern big data technologies such as Spark, Kafka, and Hadoop. The ideal candidate will have hands-on experience implementing Medallion (Bronze/Silver/Gold) architecture and optimizing large-scale data workloads while ensuring data quality, governance, and performance.
Key Responsibilities
- Design, develop, and optimize end-to-end data pipelines using Lakehouse architecture (Bronze/Silver/Gold layers).
- Build and maintain ETL/ELT pipelines using Apache Spark (PySpark/Scala) for large-scale data processing.
- Develop and manage real-time streaming solutions using Kafka, Spark Structured Streaming, or Flink.
- Build and operate Kafka producers and consumers, ensuring reliable data delivery (exactly-once/at-least-once semantics).
- Work with Hadoop ecosystem components (HDFS, YARN, Hive, Airflow/Oozie, HBase/NoSQL).
- Implement and manage Lakehouse table formats such as Delta Lake, Apache Iceberg, or Apache Hudi.
- Handle schema evolution, data versioning, ACID transactions, and time-travel capabilities.
- Optimize data storage and performance through partitioning, clustering, Z-ordering, compaction, and file size tuning.
- Implement data quality checks, observability, and lineage frameworks (e.g., Great Expectations, Deequ).
- Tune and optimize performance across Spark, Kafka, and distributed systems.
- Troubleshoot data pipeline issues including failures, data skew, and streaming backpressure.
- Deploy and manage data pipelines using CI/CD tools, containerization (Docker), and orchestration tools (Airflow/Argo/Oozie).
- Collaborate with data architects and stakeholders to ensure data governance, security, and best practices.
- Contribute to code quality, documentation, and continuous improvement initiatives.
Job Qualifications and Requirements
- Strong programming experience in Python (PySpark) or Scala.
- Hands-on experience with Apache Spark (RDD, DataFrame, Structured Streaming) in production environments.
- Solid experience with Apache Kafka (topics, partitions, offsets, consumer groups).
- Good understanding of Hadoop ecosystem (HDFS, YARN, Hive/HCatalog).
- Hands-on experience with Lakehouse platforms such as Delta Lake, Apache Iceberg, or Apache Hudi.
- Strong knowledge of Medallion Architecture (Bronze/Silver/Gold) and Lakehouse data modeling.
- Understanding of ACID transactions, schema evolution, and time-travel in data platforms.
- Experience with data formats such as Parquet, Avro, ORC and compression techniques.
- Strong SQL skills (SparkSQL, Hive, Presto, or Trino).
- Experience with CI/CD pipelines, Git, and orchestration tools (Airflow/Oozie).
- Knowledge of performance tuning for Spark and Kafka.
- Familiarity with monitoring tools (Prometheus, Grafana, Datadog, etc.).
- Experience implementing data quality frameworks and data lineage solutions.
- Strong analytical, troubleshooting, and problem-solving skills.
Need more details?
For further inquiries, kindly send an email to dfe@geco.asia