On syncing data

I’m working on an ETL pipeline, which pulls from a REST API, does some transformations, and upserts it into a database. The REST API occasionally has invalid data (e.g. alphabet characters in the zip code, or misspelled US state names), and also occasionally throws HTTP 500 error pages.

I’m supposed to stall the entire pipeline whenever an error is thrown by the REST API, but the errors occur somewhat often (it’s relatively young software) and may take a few days to fix. Since the pipeline syncs fixed windows of data (one day at a time), there may be gaps if the pipeline is stalled for several days, which necessitates either a backfill task to be run manually, or making the daily job smart enough to run from the start time of the last successful job, rather than from a fixed time span back.

We have a daily task that syncs data updated in the past day, and a separate backfill task that syncs data from an optional start date (the beginning of time by default) to a specified end date, but maybe we should ditch the daily tasks and modify the backfill task to track its latest sync date and run daily. In hindsight, I think we should’ve done that from the beginning.

Leave a Reply