Facebook: Scraping by the Numbers – About Facebook

Home » blog » Facebook: Scraping by the Numbers – About Facebook
Facebook: Scraping by the Numbers – About Facebook

Over the last month, we’ve been providing information about an internet-wide issue known as scraping. Scraping is the automated collection of data from a website or app. It can be done through authorized means, such as web crawling by a search engine, or through unauthorized means, which involves using automation to collect information in violation of our terms of service. Those who do it through unauthorized means often try to disguise their activity so that it blends in with ordinary usage. 

We’ve previously posted about how scraping works and how we are combating it. In this post, we will provide more details about our efforts to fight unauthorized scraping, and offer a deeper dive into the topic of “phone number enumeration” — a scraping technique that was at the center of recent reports about scraping on our platform.

We believe it’s important to be more transparent about our work to combat different forms of abuse on our platform. That’s why today we also launched our new Transparency Center which provides a single destination for our integrity efforts. We also just published our latest Transparency Report for the second half of 2020, as well as our Community Standards Enforcement Report for the first quarter of this year. 

How We Protect Against Data Misuse

Scraping affects a wide variety of companies and industries. Beyond social media platforms like Facebook, LinkedIn and Clubhouse, data scrapers have also collected personal information from home fitness equipment companies like Echelon and health apps like Strava as well as industries like banks, e-commerce and hospitality. Any website or app through which data can be publicly accessed is a potential scraping target.  

Facebook is well aware of this risk, and while we can never eliminate it entirely, we have several measures in place to mitigate the risk of scraping on our platform. For example:

  • We built an External Data Misuse team that consists of more than 100 people dedicated to detecting, investigating and blocking patterns of behavior associated with scraping.
  • We impose rate and data limits, which are designed to restrict how much data a single person can obtain through a certain feature, and put other obstacles in place against unauthorized automation. We block billions of suspected scraping actions per day across Facebook and Instagram.
  • We work with researchers to find and secure publicly accessible datasets that contain Facebook user data — whether the data appears to have originated from Facebook or a Facebook app developer. These datasets are found across a range of hosting providers and online platforms. The malicious actors who trade or sell these datasets often recycle or manipulate them over time, which means that many of them often contain duplicate information or inaccurate data.
  • If we find scraped datasets containing Facebook data, there are no surefire options for getting them taken down or going after those responsible for them, but we may take a number of actions. 
  • In the past year, we’ve taken over 300 enforcement actions against people who abuse our platform, including sending cease and desist letters, disabling accounts, filing lawsuits or requesting assistance from hosting providers to get them taken down. In a recent case, we successfully reached a settlement with the operator of a service that violated our Terms called Massroot8. Along with shutting down the service, we permanently banned the operator and anyone acting on his behalf from Facebook or Instagram.  

Phone Number Enumeration

One particular scraping technique that we have worked hard to combat is known as “phone number enumeration.” This involves using automated tools at scale to retrieve information about people based on their phone numbers. 

Before a set of improvements we made in September 2019, scrapers found ways to abuse various contact discovery features we had which were designed to allow people to find and connect with their contacts on Facebook. These features include the contact importer feature that people could use to upload their contacts from their mobile devices to Facebook and find matching people based on their phone numbers. We believe the scrapers used phone number enumeration to abuse this feature and scrape information. Here’s how phone number enumeration generally works using contact importer functionality. You can also check out this visual depiction of the process to see how we work to combat this technique.

  • With phone number enumeration, scrapers target densely populated areas that have an abundance of mobile phone numbers that are likely to be associated with accounts on Facebook, or other popular platforms. 
  • They choose a phone number format and automatically generate a list of target phone numbers. 
  • These numbers are used to create contact lists on a large number of simulated mobile devices. The scrapers spread their activity across numerous simulated devices to avoid tripping rate or data limits and to try to blend in with ordinary user activity. 
  • The various simulated devices are each used to upload a contact list (each containing a segment of the phone numbers on the scrapers’ list) to the contact importer of the targeted website or app.
  • By design, the contact importer returns information about matching contacts, subject to their privacy settings. The scrapers aggregate this information over time into a separate database.

The changes to the contact importer feature that we described above were focused on combating this technique. Because scrapers are always changing their methods, we regularly review and update our defenses to try to stay ahead of them. We detailed some of our methods, including rate limits, data limits, behavioral detection and other protections in a previous post

To be clear, our first line of defense against unauthorized scraping is to make it as hard as we can for people’s data to be collected at scale. We want people to feel comfortable using our services, with confidence that we protect their information, so we work to limit access to our features by scrapers while enabling people to continue using those features in order to connect and share with others.

Source