Big Data Data Engineer

Johannesburg, Gauteng
Permanent
Part-time

11 hours ago

Job SummaryWe are seeking a skilled Data Engineer to design and develop scalable data pipelines that ingest raw, unstructured JSON data from source systems and transform it into clean, structured datasets within our Hadoop-based data platform. The ideal candidate will play a critical role in enabling data availability, quality, and usability by engineering the movement of data from the Raw Layer to the Published and Functional Layers.Key Responsibilities:

Design, build, and maintain robust data pipelines to ingest raw JSON data from source systems into the Hadoop Distributed File System (HDFS).
Transform and enrich unstructured data into structured formats (e.g., Parquet, ORC) for the Published Layer using tools like PySpark, Hive, or Spark SQL.
Develop workflows to further process and organize data into Functional Layers optimized for business reporting and analytics.
Implement data validation, cleansing, schema enforcement, and deduplication as part of the transformation process.
Collaborate with Data Analysts, BI Developers, and Business Users to understand data requirements and ensure datasets are production-ready.
Optimize ETL/ELT processes for performance and reliability in a large-scale distributed environment.
Maintain metadata, lineage, and documentation for transparency and governance.
Monitor pipeline performance and implement error handling and alerting mechanisms.

Technical Skills & Experience:

3+ years of experience in data engineering or ETL development within a big data environment.
Strong experience with Hadoop ecosystem tools: HDFS, Hive, Spark, YARN, and Sqoop.
Proficiency in PySpark, Spark SQL, and HQL (Hive Query Language).
Experience working with unstructured JSON data and transforming it into structured formats.
Solid understanding of data lake architectures: Raw, Published, and Functional layers.
Familiarity with workflow orchestration tools like Airflow, Oozie, or NiFi.
Experience with schema design, data modeling, and partitioning strategies.
Comfortable with version control tools (e.g., Git) and CI/CD processes.

Nice to Have:

Experience with data cataloging and governance tools (e.g., Apache Atlas, Alation).
Exposure to cloud-based Hadoop platforms like AWS EMR, Azure HDInsight, or GCP Dataproc.
Experience with containerization (e.g., Docker) and/or Kubernetes for pipeline deployment.
Familiarity with data quality frameworks (e.g., Deequ, Great Expectations).

Qualifications:

Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field.
Relevant certifications (e.g., Cloudera, Databricks, AWS Big Data) are a plus.

In order to comply with the POPI Act, for future career opportunities, we require your permission to maintain your personal details on our database. By completing and returning this form you give PBT your consent

If you have not received any feedback after 2 weeks, please consider you application as unsuccessful.

PBT GroupRecruiter

Job Mail

Apply Now