Overview 

We are seeking a highly skilled Big Data Engineer with strong experience in Lakehouse architecture and distributed data processing to design and build scalable data platforms. The role focuses on developing end-to-end data pipelines for both batch and real-time processing using modern big data technologies such as Spark, Kafka, and Hadoop. The ideal candidate will have hands-on experience implementing Medallion (Bronze/Silver/Gold) architecture and optimizing large-scale data workloads while ensuring data quality, governance, and performance.

Key Responsibilities

Design, develop, and optimize end-to-end data pipelines using Lakehouse architecture (Bronze/Silver/Gold layers).
Build and maintain ETL/ELT pipelines using Apache Spark (PySpark/Scala) for large-scale data processing.
Develop and manage real-time streaming solutions using Kafka, Spark Structured Streaming, or Flink.
Build and operate Kafka producers and consumers, ensuring reliable data delivery (exactly-once/at-least-once semantics).
Work with Hadoop ecosystem components (HDFS, YARN, Hive, Airflow/Oozie, HBase/NoSQL).
Implement and manage Lakehouse table formats such as Delta Lake, Apache Iceberg, or Apache Hudi.
Handle schema evolution, data versioning, ACID transactions, and time-travel capabilities.
Optimize data storage and performance through partitioning, clustering, Z-ordering, compaction, and file size tuning.
Implement data quality checks, observability, and lineage frameworks (e.g., Great Expectations, Deequ).
Tune and optimize performance across Spark, Kafka, and distributed systems.
Troubleshoot data pipeline issues including failures, data skew, and streaming backpressure.
Deploy and manage data pipelines using CI/CD tools, containerization (Docker), and orchestration tools (Airflow/Argo/Oozie).
Collaborate with data architects and stakeholders to ensure data governance, security, and best practices.
Contribute to code quality, documentation, and continuous improvement initiatives.

Job Qualifications and Requirements

Strong programming experience in Python (PySpark) or Scala.
Hands-on experience with Apache Spark (RDD, DataFrame, Structured Streaming) in production environments.
Solid experience with Apache Kafka (topics, partitions, offsets, consumer groups).
Good understanding of Hadoop ecosystem (HDFS, YARN, Hive/HCatalog).
Hands-on experience with Lakehouse platforms such as Delta Lake, Apache Iceberg, or Apache Hudi.
Strong knowledge of Medallion Architecture (Bronze/Silver/Gold) and Lakehouse data modeling.
Understanding of ACID transactions, schema evolution, and time-travel in data platforms.
Experience with data formats such as Parquet, Avro, ORC and compression techniques.
Strong SQL skills (SparkSQL, Hive, Presto, or Trino).
Experience with CI/CD pipelines, Git, and orchestration tools (Airflow/Oozie).
Knowledge of performance tuning for Spark and Kafka.
Familiarity with monitoring tools (Prometheus, Grafana, Datadog, etc.).
Experience implementing data quality frameworks and data lineage solutions.
Strong analytical, troubleshooting, and problem-solving skills.

Need more details?
For further inquiries, kindly send an email to dfe@geco.asia

Big Data Engineer

Apply for this job

Big Data Engineer

Big Data Engineer

All done! Your application for the Big Data Engineer position has been submitted successfully.

All done!
Your application for the Big Data Engineer position has been submitted successfully.