A data firm used web scraping to collect the data it needed to create in-depth profiles on millions of people and businesses. Discover what web scraping is and why it can both help and hurt your business.
The data firm LocalBlox is in the business of building and selling profiles of people and companies. The firm uses web scraping to collect data from various websites, combines it with other data (e.g., purchased marketing data), and then stitches the information together to create comprehensive profiles of businesses and individuals.
For example, an individual’s profile might include the person’s name, age, addresses (IP, physical, and email), phone number, job title, current employer, income level, and lifestyle information (e.g., pet owner). A company’s profile might include its name, addresses (IP, physical, and email), phone number, annual sales, year of establishment, industrial classification (NAICS), and number of Facebook Likes. LocalBlox sells these profiles to anyone interested in using them for targeted advertising, political campaigning, or other purposes.
The firm had stored the profiles — and the 48 million data records used to create them — in a storage container (aka bucket) in the Amazon Simple Storage Service (Amazon S3) public storage cloud. Even though this bucket was unlisted, a cyber risk team found it and discovered that it was not protected with a password. As a result, the team was able to access the data, which was in human-readable format. After the team notified LocalBlox about the issue, the firm secured the bucket.
Anyone could have downloaded the buckets’ contents when it was unsecured, just like the cyber risk team did. It is unknown whether or not any hackers took advantage of this situation. In either case, this incident highlights the importance of companies password-protecting any data they store in the cloud. It also calls attention to a common practice that businesses need to be aware of: web scraping.
Web Scraping 101
To collect publicly available content from websites, people use a process called web scraping. Typically, it involves using bots and other automated technologies to extract data from sites. Search engines use this process to return and rank search results.
Firms like LocalBlox also use web scraping to collect data for marketing, data mining, and other business uses. Based on the types of information found in the profiles, LocalBlox likely scraps data from businesses’ websites, social networks (e.g., LinkedIn, Facebook), and other types of sites.
Is It Legal?
Web scraping is usually done without the knowledge or consent of the people whose data is being collected. Although this might sound illegal, there are no laws against it in many parts of the world, including the European Union and the United States. However, any company scraping EU citizens’ data needs to comply with the General Data Protection Regulation (GDPR). The requirements include getting citizens’ consent to collect, process, and store their personal data. Plus, companies must provide an easy way for people to withdraw their consent. Since bots scrape large amounts of data automatically, meeting GDPR requirements might prove very difficult.
In the United States, there have been a few court cases dealing with using scraped information for data mining. In a notable 2017 case, a small firm, hiQ Labs, sued Microsoft when the software giant ordered the company to stop scraping the data publicly posted by LinkedIn users. The judge ruled that Microsoft must let hiQ Labs scrape this LinkedIn data — a decision that Microsoft is appealing.
Web Scraping Can Be Both Helpful and Harmful
Web scraping can be beneficial. It lets your business be included in search engine results. However, it has the potential to cause harm as well. Firms might use the data they collect for illegal or unethical purposes, such as stealing copyrighted data or undercutting prices. In addition, cybercriminals might also scrape websites to get information for use in cyberattacks. Thus, it pays to take a few precautions: