Internet becomes more and more important in our day-to-day lives. The amount of data produced by Internet has increased from 0.1 zettabytes in 2013 to 4.4 zettabytes in 2020. It has become a challenge to scrape this amount of data from internet, store and process it.
Saturn Data built a scalable and cost-effective solution to handle these challenges:
Scalable to scape 1000+ mobiles apps
Reasonable crawling speed
Store 10+ Petabytes of data
Ensure quality data and adapt to the mobile app changes
Query 10+ Petabytes of data
The specs of scraping
The specs consists of what to scrape and the frequency of the scraping. Clients have very different requirements. Spec service and Scheduler are solutions to meet clients' requirements. We convert these requirements into structured data. Scheduler starts the jobs based on the set cadence. Spec service retrieves the structured specs and returns to Fetchers.
Scheduler kicks off the seed fetch service. Seed fetch service gets the seeds of URLs for a mobile app. All the URLs are sent to message queue for asynchronous processing. Fetcher/Renderer reads the URLs and parsed out the desired information from iOS or Android based on the specs, then stores the docs in object storage (S3) and desired information in database. The fetcher also calls Doc Dedupe service to eliminate the duplicate docs and sends next-level unprocessed docs to URL extractor. URL extractor extracts next-level URLs and saves them in database. Seed Fetch service reads the next level URLs and repeats the above process.
To allow querying 10+ Petabytes of data, we leverage Apache Spark - Unified Engine for large-scale data analytics. Apache Sparks has these benefits:
100x faster than relational database like MySQL
Easy of Use. Spark SQL is used to query data and friendly for most of engineers and scientists
Cost effective because of its in-memory data processing
Validation service validates the results by inspecting the data volume, key columns, data aggregation, machine learning models and so on. Monitors and alarms are in place to ensure reliability. The data is delivered in many format like CSV. We also offer business intelligence reports to provide insights for the data (Data mining)
Saturn Data collects data at 500+ QPS from mobile apps across the world. So, handing mobile requests at large scale is the key to our business. We optimized Fetcher/Renderer in these ways:
Event-driven microservice. It can handles URL request at any scale.
Provisioning a machine is fast. The machine pre-configured the environment, code by a container image.
When the machine is idle, no cost will be incurred. So, it is cost-effective.
Saturn Data's services are designed for 99.9% availability across multiple data regions. Here are how we achieved it:
Built a custom Http Manager that sends requests to a mobile app in a pre-computed rate to ensure no overburden to the target websites. Http Manager retries the requests to increase the availability of our services
Restore the scraping from the previous stored state at any time. If the target website is down or not available, our service will store the state in the database.
Data replication and fault tolerance. Data is replicated in eventually consistency.
Scraping data from mobile apps in a scalable and cost-effective manner is very challenging, which requires sophisticated infrastructure.
Saturn Data makes mobile app scraping simple and accessible to everyone. The price starts from $9.99. Contact us today!