Backfill Jobs
Running and monitoring backfill jobs
A backfill job is a single execution of a pipeline that copies Firestore documents to BigQuery. This guide covers how to run jobs, monitor progress, and troubleshoot issues.
Running a Backfill
Manual Backfill
To run a backfill manually:
- Navigate to your pipeline’s detail page
- Click Run Backfill
- Confirm to start the job
The job will be queued and start processing within a few moments.
Scheduled Backfills
Pipelines configured with scheduled triggers automatically run backfills at the specified intervals. Scheduled jobs appear in the Jobs list like manual jobs.
Monitoring Jobs
Job List
Navigate to Jobs in the dashboard to see all backfill jobs across your pipelines. Each job shows:
- Pipeline name
- Status
- Start time
- Duration (if completed)
- Documents processed
Job Detail
Click on a job to see detailed information:
- Dataflow Job ID: The Google Cloud Dataflow job identifier
- Status: Current job state
- Timing: When the job started and completed
- Metrics: Documents and bytes processed
- Errors: Any error messages if the job failed
You can also link directly to the Google Cloud Console to see full Dataflow job details.
Job Statuses
Jobs progress through several states:
| Status | Description |
|---|---|
| Pending | Job is queued and waiting to start |
| Running | Job is actively processing documents |
| Succeeded | Job completed successfully |
| Failed | Job encountered an error and stopped |
| Cancelled | Job was manually cancelled |
Pending
A job enters the pending state when it’s first created. Dataflow takes a short time to provision workers and start the job.
Running
The job is actively reading from Firestore and writing to BigQuery. During this phase, you’ll see document counts update as processing continues.
Succeeded
The job completed without errors. All documents from the source collection have been written to the destination table.
Failed
The job encountered an error and could not complete. Check the error message for details. Common causes include:
- Permission issues with Firestore or BigQuery
- Schema mismatches between document fields and table columns
- Quota exceeded in GCP
- Network or service availability issues
Cancelled
The job was manually stopped before completion. Partial data may have been written to BigQuery.
Cancelling Jobs
To cancel a running job:
- Navigate to the job’s detail page
- Click Cancel Job
- Confirm the cancellation
Cancellation requests are sent to Dataflow and may take a few moments to take effect. The job status will update to “Cancelled” once complete.
Note that cancelling a job does not roll back data already written to BigQuery.
Troubleshooting Failed Jobs
Check the Error Message
The job detail page shows error messages from Dataflow. Common errors include:
Permission Denied
- Verify Fireconduit has access to your Firebase and GCP projects
- Check that the service account has the required BigQuery and Firestore permissions
Schema Mismatch
- If writing to an existing table, ensure your pipeline schema matches the table schema
- Check that field types are compatible (e.g., don’t map a Firestore string to a BigQuery INTEGER)
Quota Exceeded
- Check your GCP quotas for Dataflow, BigQuery, and Compute Engine
- Consider reducing max workers or running during off-peak hours
Collection Not Found
- Verify the collection path is correct
- Ensure the Firestore database name is correct
View Dataflow Logs
For detailed troubleshooting, click the link to view the job in Google Cloud Console. Dataflow provides:
- Worker logs with detailed error traces
- Resource utilization graphs
- Step-by-step pipeline execution details
Retry Failed Jobs
To retry a failed job, simply start a new backfill from the pipeline detail page. Each job is independent, so previous failures don’t affect new runs.
Best Practices
Start with Small Collections
When testing a new pipeline, run a backfill on a small collection first to verify everything works correctly before processing large datasets.
Monitor First Runs
Watch your first few backfill jobs closely. Check that:
- Documents are being processed at expected rates
- The data in BigQuery looks correct
- Costs are within expected ranges
Set Up Alerts
Consider setting up GCP monitoring alerts for:
- Dataflow job failures
- Unusual resource usage
- Budget thresholds
Avoid Concurrent Jobs
Running multiple backfill jobs on the same pipeline simultaneously can cause issues. Wait for one job to complete before starting another.