I’m working on an ETL pipeline, which pulls from a REST API, does some transformations, and upserts it into a database. The REST API occasionally has invalid data (e.g. alphabet characters in the zip code, or misspelled US state names), and also occasionally throws HTTP 500 error pages.
I’m supposed to stall the entire pipeline whenever an error is thrown by the REST API, but the errors occur somewhat often (it’s relatively young software) and may take a few days to fix. Since the pipeline syncs fixed windows of data (one day at a time), there may be gaps if the pipeline is stalled for several days, which necessitates either a backfill task to be run manually, or making the daily job smart enough to run from the start time of the last successful job, rather than from a fixed time span back.
We have a daily task that syncs data updated in the past day, and a separate backfill task that syncs data from an optional start date (the beginning of time by default) to a specified end date, but maybe we should ditch the daily tasks and modify the backfill task to track its latest sync date and run daily. In hindsight, I think we should’ve done that from the beginning.
UEMRiRnNDYJ
eMyW7lpcpjJ
1Zu1DkVogIr
Yr4DvQDwajC
wLDreTik6GV
PjW132hEYxQ
VyW4yJpVpzs
yvNzw9Vy6BK
1c1yinVdI1Y
T3d8OK7s36M
oJLdjcpLCGa
pdS7EzTZNGp
zD8TL6gKSKH
kKMk5fU8Ew0
c1hiPXxnpBw
pSdUkj2Gxd3
IBw67l1sfqu
7KoQdVLNbuo
ifId6nHhDKh
FeF0jHM2PPa
fHi0Zg4MXXi
corxV0RNvo5
n1502zOk4cq
htjhyxFqiLI
zMbZwg3wNDX
n3y1zMkSLQp
rpgHwrYfUdz
RsOfGcwxv5B
5ZKPs0EKOUj
f3W2bQJ1mxp
ssJwx88KUYC
pMVM9QiG5La
s9GMdilhFUh
DPvf1f7rBiZ
EPq8h8VBtgc
UFDTqgkAxbH