How I scraped wallapop
Hehe, you took the bait? I did not scrap them. Probably scraping their content is illegal. So I did not scrap wallapop per se but tried to extend their built-in functionalities. They can deliver quite inaccurate results for custom-saved searches once a day. Unfortunately, this latency was too high for the kind of product I was interested in and it gave poor results. Instead of pressing f5 every 5 minutes personally, I guess it is fine to program something that does it for you (legitimate purpose mantra accomplished).
It is obvious that search query gets url encoded, so it appeared as a quite straight-forward task. Not that easy bud. Using beautiful soup directly with the query url returned nothing. After inspecting the page source, I noticed that the page is loaded/rendered on the air using react/javascript (excuse me if I am not technically correct because I ignore this kind of jargon/technology). Then I asked god google what I could do next (using python). Some people suggested to use Selenium, so I tried (and succeeded).
Selenium is a webDriver (browser emulator) that have some bindings build upon it, providing an API-like functionalities (e.g. selenium-python). So then, let's go to the point.
bot structure/infinite loop
We'll build a bot that:
- queries one or more URLs,
- refine the search using custom filters (string pattern matching)
- it notifies us through our personal slack if it finds any new item.
I used Slack because it is what I am used to, it has a free version and it does not require to install additional dependencies to use webhooks, ofc you can use whatever you want (e.g. telegram bot, discord ...). If you never did that before, here's an Abhishek's video explaining how to set up webhooks using slack that may come in handy
You can do a first search (try to refine it using price ranges, dates, etc.) and reuse that url, in this post we will use this url as an example query (split for readability):
'https://es.wallapop.com/search?'
'time_filter=lastWeek&'
'keywords=thermomix%20lidl&'
'min_sale_price=100&'
'max_sale_price=350&'
'latitude=19.9033824&'
'longitude=-75.1018545&'
'filters_source=search_box'
Then, using selenium we will:
- (click on accept cookies)
- scroll-down the page so products get loaded and it reaches "show more" button.
- loop over last point till there are no more items
- retrieve info from tags using beautiful soup
- store new (filtered) items in a dataframe and trigger a slack message with the item and the link
I did it with an endless script with 4 functions. I'll briefly comment those functions and then just paste the whole script. (Later me: I'll share the whole script and comment it)
Here's the anonymized code (gist). I am not sharing the entire repo because my code ended up intermingled with other stuff/bots and had even poorer coding practices.
import requests # post to webhook
import json # post to webhook
import os # retrieve environmental var webhook
import pandas as pd # build product table
import numpy as np # to generate randomness
import datetime # timestamps on prints
from bs4 import BeautifulSoup # soup the page contents
from selenium import webdriver # browser driver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options # to suppress the browser
from selenium.webdriver import DesiredCapabilities
import time # sleeps
# fixed settings
SCROLL_PAUSE_TIME = 5
SLEEP_BETWEEN_CRAWL = 300
urls = [(
'https://es.wallapop.com/search?'
'time_filter=lastWeek&'
'keywords=thermomix%20lidl&'
'min_sale_price=100&'
'max_sale_price=350&'
'latitude=19.9033824&'
'longitude=-75.1018545&'
'filters_source=search_box'
)] # add more urls if required.
# filters below are done in product description (?)
words_included = ['thermomix','robot', 'cocina']
words_excluded = ['cecotec', 'taurus']
def report(textstr):
"""report textstr to webhooked slack channel"""
webhook = os.environ.get('webhook_slack')
data = {
'text': textstr
}
requests.post(webhook, json.dumps(data))
def get_wallapop_fields(tag):
"""scrutinizes product tags and returns a list of its fields"""
data_item_id = int(tag.get('data-item-id'))
price = float(tag.get('data-sell-price'))
title = tag.get('data-title')
children = list(tag.children)
link = 'https://es.wallapop.com'+children[1].get('href')
descr = tag.find('p', class_="product-info-description").get_text()
return [data_item_id, price, title, link, descr]
def pull_wallapop(url, words_included=[], words_excluded=[]):
"""core function of the script,
it will run in the endless loop"""
print(f"{datetime.datetime.now().strftime('%H:%M:%S')}: pulling wallapop")
# some config (else headless driver wont work)
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
option = webdriver.ChromeOptions()
option.add_argument("--no-sandbox")
option.add_argument("--window-size=1920x1080")
option.add_argument("--disable-extensions")
option.add_argument("--proxy-server='direct://'")
option.add_argument("--proxy-bypass-list=*")
option.add_argument("--start-maximized")
option.add_argument('--headless')
option.add_argument('--disable-gpu')
option.add_argument('--disable-dev-shm-usage')
option.add_argument('--ignore-certificate-errors')
# browser = webdriver.Chrome(
# "./notebooks/chromedriver_win32/chromedriver.exe",
# options=option, desired_capabilities=capabilities
# ) # debugging in a windows box
browser = webdriver.Chrome(
"/usr/bin/chromedriver", # path to binary
options=option, # options defined above
desired_capabilities=capabilities # idem
) # raspbian
browser.get(url) # loads page
browser.implicitly_wait(10) # let it load :)
try: # apparently not required with headless
# click on accept cookies
browser.find_element_by_xpath(
'/html/body/div[1]/div/div/div/div/div/div[3]/button[2]'
).click()
except:
print('xpath button not found')
pass
time.sleep(SCROLL_PAUSE_TIME+np.random.random())
# adding some randomness in our timings might help us not being detected as a bot
# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")
iter = 0
print('starting scrolldown loop')
while True: # loop til there are no more page to scrolldown/load more products
cheight = browser.execute_script("return document.body.scrollHeight")
browser.execute_script(f"window.scrollTo(0, {cheight-1000});")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME+np.random.random())
try:
browser.find_element_by_xpath(
'//*[@id="more-products-btn"]/div/button'
).click() # clicks on "more products" button
print((f"{datetime.datetime.now().strftime('%H:%M:%S')}: "
"clicked in more products already"))
time.sleep(SCROLL_PAUSE_TIME+np.random.random())
except:
print((f"{datetime.datetime.now().strftime('%H:%M:S')}: "
"'click more products' not found in iter {iter}"))
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
pass
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
print(f'breaking at iter {iter} in no more scrolldown/height')
break
last_height = new_height
iter += 1
# use beautifulsoup to digest resulting page
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.find_all("div", class_="card js-masonry-item card-product product")
browser.quit() # quit chrome/chromium
ll = []
# get list or lists [[product fields]]
for i in range(len(results)):
ll += [get_wallapop_fields(results[i])]
# create a table with it
df = pd.DataFrame(
ll, # our list of lists
columns=['item_id', 'price', 'title', 'link', 'descr']
).set_index('item_id')
df['keep']=False
if len(words_included): # apply filtering in title/description
pattern_in= '|'.join(words_included)
df.loc[df.title.str.contains(pattern_in, case=False), 'keep'] = True
df.loc[df.descr.str.contains(pattern_in, case=False), 'keep'] = True
if len(words_excluded):
pattern_out= '|'.join(words_excluded)
df.loc[df.title.str.contains(pattern_out, case=False), 'keep'] = False
df.loc[df.descr.str.contains(pattern_out, case=False), 'keep'] = False
print(f'returning a crawled + filtered df with shape {df[df.keep].shape}')
return df[df.keep] # just return those with keep flag
def report_(df,indexlist):
"""call report in those new indexes.
index is a list!"""
strout = '' # string to be crafted
for index in indexlist:
price = df.loc[index, 'price']
title = df.loc[index,'title']
link = df.loc[index,'link']
strout += f'found a new item for **{price}**\n**{title}**\nhere {link}\n\n'
report(strout)
# run script
if __name__=='__main__':
try:
try:
df_prev = pd.read_pickle('frame.pkl')
except:
print(('if first run, this is fine, '
'else you should not read this aka something is not working'))
df_prev= pd.DataFrame([])
while True:
continueflag = False
dflist = []
for url in urls:
try:
dflist += [pull_wallapop(
url,
words_included=words_included,
words_excluded=words_excluded)]
except Exception as e:
print((f'exception while pulling wallapop.\n{e}\n'
f'sleeping for {SLEEP_BETWEEN_CRAWL} and trying again'))
#raise e # uncomment to debug
continueflag = True
break
if continueflag:
time.sleep(SLEEP_BETWEEN_CRAWL)
continue
curr_df = pd.concat(dflist)
curr_df = curr_df[~curr_df.index.duplicated(keep='first')]
if not len(df_prev): # first run, just save
curr_df.to_pickle('frame.pkl')
else:
old_items = df_prev.index.values
new_items = curr_df.index.values
# check whether there's something new
new_indexes = [x for x in new_items if x not in old_items]
if len(new_indexes): # if so, report those items
print('found all these new item ids',new_indexes)
report_(curr_df,new_indexes)
# update old and store. Clean manually from time to time*
df_prev = pd.concat([df_prev,curr_df])
# if ads get updated, this won't warn
df_prev = df_prev[~df_prev.index.duplicated(keep='last')]
# saving table avoids spamming after restarting the script
df_prev.to_pickle('frame.pkl')
print((f"{datetime.datetime.now().strftime('%H:%M:%S')}: "
f"waiting {SLEEP_BETWEEN_CRAWL}s till next crawl"))
time.sleep(SLEEP_BETWEEN_CRAWL)
except Exception as e:
topost = f'wallapop crawler died with exception:\n{e}'
report(topost)
You can run it in a headless raspberry pi using ~1€/month power consumption. If that's the case, take a look at berryconda (saves a lot of time when installing pandas) and how to install chromedriver in armv7l.
Disclaimer: Do not abuse. I am not encouraging you to do anything illegal. Also, after some time this script stopped working on the raspberry (still worked in a windows box) so perhaps chromium/webdriver interaction was broken or it got black-listed somehow. This strategy is widely used, when talking with sellers, most of the times I was not the first to contact them (5 min latency max.)
You can find similar resources here and here in Spanish.
Cheers and happy shopping :)
EDIT
I just found this repo, which I haven't tried but looks way more complete + ready to deploy
EDIT2
Because this page is getting some views, here's an important addendum I did not know back when I wrote this post.
After trying to crawl sites reluctant to robots, I learned that when using headless mode, selenium sets user-agent to headless-chromium
or something I cannot recall at the moment. In any case, we are telling plain-sight that our browser instance is run by a bot. Perhaps this is not what you want.
Check this repo or those posts (1, 2) for more info (or google it, there are a lot of results).
Cheers!