top of page
  • Writer's pictureGary Shum

Mobile App Scraping as a Service at Saturn Data

Updated: Jun 29

Internet becomes more and more important in our day-to-day lives. The amount of data produced by Internet has increased from 0.1 zettabytes in 2013 to 4.4 zettabytes in 2020. It has become a challenge to scrape this amount of data from internet, store and process it.


Saturn Data built a scalable and cost-effective solution to handle these challenges:

  • Scalable to scape 1000+ mobiles apps

  • Reasonable crawling speed

  • Store 10+ Petabytes of data

  • Ensure quality data and adapt to the mobile app changes

  • Query 10+ Petabytes of data



Architecture Overview



The specs of scraping

The specs consists of what to scrape and the frequency of the scraping. Clients have very different requirements. Spec service and Scheduler are solutions to meet clients' requirements. We convert these requirements into structured data. Scheduler starts the jobs based on the set cadence. Spec service retrieves the structured specs and returns to Fetchers.


Scraping services

Scheduler kicks off the seed fetch service. Seed fetch service gets the seeds of URLs for a mobile app. All the URLs are sent to message queue for asynchronous processing. Fetcher/Renderer reads the URLs and parsed out the desired information from iOS or Android based on the specs, then stores the docs in object storage (S3) and desired information in database. The fetcher also calls Doc Dedupe service to eliminate the duplicate docs and sends next-level unprocessed docs to URL extractor. URL extractor extracts next-level URLs and saves them in database. Seed Fetch service reads the next level URLs and repeats the above process.


Query engine

To allow querying 10+ Petabytes of data, we leverage Apache Spark - Unified Engine for large-scale data analytics. Apache Sparks has these benefits:

  • 100x faster than relational database like MySQL

  • Easy of Use. Spark SQL is used to query data and friendly for most of engineers and scientists

  • Cost effective because of its in-memory data processing


Reporting services

Validation service validates the results by inspecting the data volume, key columns, data aggregation, machine learning models and so on. Monitors and alarms are in place to ensure reliability. The data is delivered in many format like CSV. We also offer business intelligence reports to provide insights for the data (Data mining)



Scalability

Saturn Data collects data at 500+ QPS from mobile apps across the world. So, handing mobile requests at large scale is the key to our business. We optimized Fetcher/Renderer in these ways:

  • Event-driven microservice. It can handles URL request at any scale.

  • Provisioning a machine is fast. The machine pre-configured the environment, code by a container image.

  • When the machine is idle, no cost will be incurred. So, it is cost-effective.


Availability

Saturn Data's services are designed for 99.9% availability across multiple data regions. Here are how we achieved it:

  • Built a custom Http Manager that sends requests to a mobile app in a pre-computed rate to ensure no overburden to the target websites. Http Manager retries the requests to increase the availability of our services

  • Restore the scraping from the previous stored state at any time. If the target website is down or not available, our service will store the state in the database.

  • Data replication and fault tolerance. Data is replicated in eventually consistency.


Conclusion

Scraping data from mobile apps in a scalable and cost-effective manner is very challenging, which requires sophisticated infrastructure.


Saturn Data makes mobile app scraping simple and accessible to everyone. The price starts from $9.99. Contact us today!











48 views0 comments

Recent Posts

See All
Post: Blog2_Post
bottom of page