ВнешняяFreelancerRemote$2500–$0 USD

AWS PySpark ETL Pipeline

Краткое

Freelancer Client is hiring: AWS PySpark ETL Pipeline.

Location: Remote

I need a cloud-native ETL pipeline built end-to-end on AWS, coded in PySpark and designed for production reliability. The pipeline will ingest data from three sources—databases, APIs, and file systems—then standardise and load it into an analytics-ready destination. Source files arrive in a mix of CSV, JSON, and Parquet, so the job must include automatic format detection, schema inference, and efficient column-wise writes.

Beyond raw transformation, I want solid engineering practices: parameter-driven jobs, modular Spark code, unit tests, logging, alerting, and retry logic. Leveraging AWS native services such as Glue, EMR, Lambda, and S3 is expected, but I’m open to other AWS components if they shorten development time or lower cost.

PySpark scripts (or Glue jobs) that extract from the three source types and load to S3 or Redshift

Infrastructure-as-code templates (CloudFormation or Terraform) to spin up all required AWS resources

README with execution steps, config examples, and troubleshooting notes

Skills: Python, NoSQL Couch & Mongo, Amazon Web Services, Hadoop, AWS Lambda, ETL, PySpark, Terraform

Budget: $2500–$0 USD


Source: Freelancer Client via Remote / Online. Apply on the source website.

Оригинал

I need a cloud-native ETL pipeline built end-to-end on AWS, coded in PySpark and designed for production reliability. The pipeline will ingest data from three sources—databases, APIs, and file systems—then standardise and load it into an analytics-ready destination. Source files arrive in a mix of CSV, JSON, and Parquet, so the job must include automatic format detection, schema inference, and efficient column-wise writes.

Beyond raw transformation, I want solid engineering practices: parameter-driven jobs, modular Spark code, unit tests, logging, alerting, and retry logic. Leveraging AWS native services such as Glue, EMR, Lambda, and S3 is expected, but I’m open to other AWS components if they shorten development time or lower cost.

Candidates must have expertise in data engineering.

Deliverables
• PySpark scripts (or Glue jobs) that extract from the three source types and load to S3 or Redshift
• Infrastructure-as-code templates (CloudFormation or Terraform) to spin up all required AWS resources
• README with execution steps, config examples, and troubleshooting notes
• A brief hand-over session to walk through deployment and scheduling

Acceptance criteria
– Successful end-to-end run on my AWS account using sample data I provide
– Data landed in target store, partitioned and compressed, with row counts matching sources
– Logs visible in CloudWatch and errors retried or surfaced clearly

If this aligns with your skill-set and timeline, let me know how you’d approach the build and an estimate of effort.

Локация & Details

ИсточникFreelancer
Бюджет$2500–$0 USD
ЛокацияRemote
Дата публикации2026-05-18 05:53:32
PythonNoSQL Couch & MongoAmazon Web ServicesHadoopAWS LambdaETLPySparkTerraform
Перейти к источнику →

About this listing

This remote opportunity was imported from Freelancer and is shown here for discovery. To apply, follow the link to the original posting.

Skills mentioned:
PythonNoSQL Couch & MongoAmazon Web ServicesHadoopAWS LambdaETLPySparkTerraform