AWS PySpark ETL Pipeline
Краткое
Freelancer Client is hiring: AWS PySpark ETL Pipeline.
Location: Remote
I need a cloud-native ETL pipeline built end-to-end on AWS, coded in PySpark and designed for production reliability. The pipeline will ingest data from three sources—databases, APIs, and file systems—then standardise and load it into an analytics-ready destination. Source files arrive in a mix of CSV, JSON, and Parquet, so the job must include automatic format detection, schema inference, and efficient column-wise writes.
Beyond raw transformation, I want solid engineering practices: parameter-driven jobs, modular Spark code, unit tests, logging, alerting, and retry logic. Leveraging AWS native services such as Glue, EMR, Lambda, and S3 is expected, but I’m open to other AWS components if they shorten development time or lower cost.
PySpark scripts (or Glue jobs) that extract from the three source types and load to S3 or Redshift
Infrastructure-as-code templates (CloudFormation or Terraform) to spin up all required AWS resources
README with execution steps, config examples, and troubleshooting notes
Skills: Python, NoSQL Couch & Mongo, Amazon Web Services, Hadoop, AWS Lambda, ETL, PySpark, Terraform
Budget: $2500–$0 USD
Source: Freelancer Client via Remote / Online. Apply on the source website.
Оригинал
I need a cloud-native ETL pipeline built end-to-end on AWS, coded in PySpark and designed for production reliability. The pipeline will ingest data from three sources—databases, APIs, and file systems—then standardise and load it into an analytics-ready destination. Source files arrive in a mix of CSV, JSON, and Parquet, so the job must include automatic format detection, schema inference, and efficient column-wise writes.
Beyond raw transformation, I want solid engineering practices: parameter-driven jobs, modular Spark code, unit tests, logging, alerting, and retry logic. Leveraging AWS native services such as Glue, EMR, Lambda, and S3 is expected, but I’m open to other AWS components if they shorten development time or lower cost.
Candidates must have expertise in data engineering.
Deliverables
• PySpark scripts (or Glue jobs) that extract from the three source types and load to S3 or Redshift
• Infrastructure-as-code templates (CloudFormation or Terraform) to spin up all required AWS resources
• README with execution steps, config examples, and troubleshooting notes
• A brief hand-over session to walk through deployment and scheduling
Acceptance criteria
– Successful end-to-end run on my AWS account using sample data I provide
– Data landed in target store, partitioned and compressed, with row counts matching sources
– Logs visible in CloudWatch and errors retried or surfaced clearly
If this aligns with your skill-set and timeline, let me know how you’d approach the build and an estimate of effort.
Локация & Details
Перейти к источнику →About this listing
This remote opportunity was imported from Freelancer and is shown here for discovery. To apply, follow the link to the original posting.