Data Extraction at Scale
Choosing the Right Approach
Extracting data in bulk is a common requirement for our clients, whether for reporting, analytics, migrations, or integrations with other systems. However, the best method for retrieving large volumes of data depends on various factors, including the type of data required, the speed of access, and the impact on the system for other users.
This article explores three primary approaches to bulk data extraction—APIs, scheduled exports, and database access—outlining their advantages, limitations, and ideal use cases. By the end, you’ll have a clear understanding of which method best suits your needs, helping you optimize performance while ensuring data accuracy.
API Access
Ideal for:
- Fetching specific records or small batches of records
- Filtering records according to commonly used criteria
Bipsync's API is a good place to start for users wishing to explore the data that is available to them. Our API documentation gives some insight into the endpoints that we offer. All data is returned in JSON format.
Users wishing to read data will find it most suitable for simple use cases: fetching individual records or small batches of records. A strategy of pagination through these datasets is appropriate when working with collections that are relatively small in size (< 10k records), or have been filtered to a comparable dimension by applying criteria via parameters.
For datasets that exceed this size, the limitations of pagination windows mean that a full extraction operation will likely take a significant amount of time. Users wishing to reduce this time window will often increase the number of requests that they make to the API, and it is at this point that a different export strategy is worth considering, to make the extraction more efficient and reduce strain on the system. The system will rate limit requests that have been deemed to be too demanding.
Scheduled Exports
Ideal for:
- Bulk export of data either at regular intervals or as single events
- Filtering records according to commonly used criteria
Bipsync's Archiver is able to compile data extracts into zip archives that can then be delivered to users in several ways: placed into an AWS S3 bucket, an upload to a SFTP location, and so on.
This approach is the quickest way to access data in the same format that it is provided via the API when that size of the dataset is substantial. This is because Bipsync handles the extraction and performs it without the overhead of HTTP communication, which means we can commit more resources to the process and reduce the impact on the overall system. We also have the ability to apply a greater range of criteria to the extraction query, because the API parameters only cover common use cases.
The exports can contain multiple different datasets. They can be run either as one off ad-hoc export, or repeatedly on a schedule. The output formats don't have to be in JSON; we also support Word and PDF formats.
Scheduled Exports can be arranged via one of our CSMs.
Database Access
Ideal for:
- Fetching individual records or medium-sized batches of data in real-time
- Filtering records according to simple and complex criteria
Clients that have access to our Reporting and Analytics Platform (RAP) are able to query a read-only MySQL database that contains much of the data available via the API or Scheduled Export methods. The suitability of this method will depend on the nature of the datasets that are required:
- All types of content (research, entities, contacts etc.) are included
- Text and HTML field values are included on demand
- "Grouped" field values are not included
This approach might require more preparation than the other methods, but since it involves direct connection to a database dedicated to this task, it is very fast and has no impact on the rest of the system. Because it is SQL based, it is the most powerful way to filter the content based on any criteria that should apply. By leveraging SQL features like joins, grouping and distinct queries, it's possible to compile datasets in a fraction of the time that would be required to achieve the same result via the API or Scheduled Export strategies.
If you don't have access to our Reporting and Analytics platform and would like to, please get in touch to discuss this with us.
Conclusion
There are many ways to extract data from Bipsync, each with its own advantages and disadvantages, but ultimately any use case that a client may have will be covered by an appropriate strategy. We're always happy to talk to clients to help them decide the best strategy for a given task, and hopefully this document has served to outline some of the possibilities.
Updated 3 months ago
