Scraping wallapop

How I scraped wallapop

Hehe, you took the bait? I did not scrap them. Probably scraping their content is illegal. So I did not scrap wallapop per se but tried to extend their built-in functionalities. They can deliver quite inaccurate results for custom-saved searches once a day. Unfortunately, this latency was too high for the kind of product I was interested in and it gave poor results. Instead of pressing f5 every 5 minutes personally, I guess it is fine to program something that does it for you (legitimate purpose mantra accomplished).

It is obvious that search query gets url encoded, so it appeared as a quite straight-forward task. Not that easy bud. Using beautiful soup directly with the query url returned nothing. After inspecting the page source, I noticed that the page is loaded/rendered on the air using react/javascript (excuse me if I am not technically correct because I ignore this kind of jargon/technology). Then I asked god google what I could do next (using python). Some people suggested to use Selenium, so I tried (and succeeded).

Selenium is a webDriver (browser emulator) that have some bindings build upon it, providing an API-like functionalities (e.g. selenium-python). So then, let's go to the point.

bot structure/infinite loop

We'll build a bot that:

  • queries one or more URLs,
  • refine the search using custom filters (string pattern matching)
  • it notifies us through our personal slack if it finds any new item.

I used Slack because it is what I am used to, it has a free version and it does not require to install additional dependencies to use webhooks, ofc you can use whatever you want (e.g. telegram bot, discord ...). If you never did that before, here's an Abhishek's video explaining how to set up webhooks using slack that may come in handy

You can do a first search (try to refine it using price ranges, dates, etc.) and reuse that url, in this post we will use this url as an example query (split for readability):

'https://es.wallapop.com/search?'
    'time_filter=lastWeek&'
    'keywords=thermomix%20lidl&'
    'min_sale_price=100&'
    'max_sale_price=350&'
    'latitude=19.9033824&'
    'longitude=-75.1018545&'
    'filters_source=search_box'

Then, using selenium we will:

  • (click on accept cookies)
  • scroll-down the page so products get loaded and it reaches "show more" button.
  • loop over last point till there are no more items
  • retrieve info from tags using beautiful soup
  • store new (filtered) items in a dataframe and trigger a slack message with the item and the link

I did it with an endless script with 4 functions. I'll briefly comment those functions and then just paste the whole script. (Later me: I'll share the whole script and comment it)

Here's the anonymized code (gist). I am not sharing the entire repo because my code ended up intermingled with other stuff/bots and had even poorer coding practices.

import requests # post to webhook
import json # post to webhook
import os # retrieve environmental var webhook
import pandas as pd # build product table
import numpy as np # to generate randomness
import datetime # timestamps on prints
from bs4 import BeautifulSoup # soup the page contents
from selenium import webdriver # browser driver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options  # to suppress the browser
from selenium.webdriver import DesiredCapabilities
import time # sleeps


# fixed settings
SCROLL_PAUSE_TIME = 5
SLEEP_BETWEEN_CRAWL = 300

urls = [(
    'https://es.wallapop.com/search?'
    'time_filter=lastWeek&'
    'keywords=thermomix%20lidl&'
    'min_sale_price=100&'
    'max_sale_price=350&'
    'latitude=19.9033824&'
    'longitude=-75.1018545&'
    'filters_source=search_box'
)] # add more urls if required.

# filters below are done in product description (?)
words_included = ['thermomix','robot', 'cocina']
words_excluded = ['cecotec', 'taurus']


def report(textstr):
    """report textstr to webhooked slack channel"""
    webhook = os.environ.get('webhook_slack')
    data = {
        'text': textstr
    }
    requests.post(webhook, json.dumps(data))

def get_wallapop_fields(tag):
    """scrutinizes product tags and returns a list of its fields"""
    data_item_id = int(tag.get('data-item-id'))
    price = float(tag.get('data-sell-price'))
    title = tag.get('data-title')
    children = list(tag.children)
    link = 'https://es.wallapop.com'+children[1].get('href')
    descr = tag.find('p',  class_="product-info-description").get_text()
    return [data_item_id, price, title, link, descr]


def pull_wallapop(url, words_included=[], words_excluded=[]):
    """core function of the script, 
    it will run in the endless loop"""
    print(f"{datetime.datetime.now().strftime('%H:%M:%S')}: pulling wallapop")
    # some config (else headless driver wont work)
    capabilities = DesiredCapabilities.CHROME.copy()
    capabilities['acceptSslCerts'] = True 
    capabilities['acceptInsecureCerts'] = True
    option = webdriver.ChromeOptions()

    option.add_argument("--no-sandbox")
    option.add_argument("--window-size=1920x1080")
    option.add_argument("--disable-extensions")
    option.add_argument("--proxy-server='direct://'")
    option.add_argument("--proxy-bypass-list=*")
    option.add_argument("--start-maximized")
    option.add_argument('--headless')
    option.add_argument('--disable-gpu')
    option.add_argument('--disable-dev-shm-usage')
    option.add_argument('--ignore-certificate-errors')

    # browser = webdriver.Chrome(
        # "./notebooks/chromedriver_win32/chromedriver.exe", 
        # options=option, desired_capabilities=capabilities
        # ) # debugging in a windows box
    browser = webdriver.Chrome(
        "/usr/bin/chromedriver", # path to binary
        options=option, # options defined above
        desired_capabilities=capabilities # idem
        ) # raspbian
    browser.get(url) # loads page
    browser.implicitly_wait(10) # let it load :)
    try: # apparently not required with headless
        # click on accept cookies
        browser.find_element_by_xpath(
            '/html/body/div[1]/div/div/div/div/div/div[3]/button[2]'
            ).click()
    except:
        print('xpath button not found')
        pass

    time.sleep(SCROLL_PAUSE_TIME+np.random.random())
    # adding some randomness in our timings might help us not being detected as a bot

    # Get scroll height
    last_height = browser.execute_script("return document.body.scrollHeight")
    iter = 0
    print('starting scrolldown loop')
    while True: # loop til there are no more page to scrolldown/load more products
        cheight = browser.execute_script("return document.body.scrollHeight")
        browser.execute_script(f"window.scrollTo(0, {cheight-1000});")
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME+np.random.random())
        try:
            browser.find_element_by_xpath(
                '//*[@id="more-products-btn"]/div/button'
                ).click() # clicks on "more products" button
            print((f"{datetime.datetime.now().strftime('%H:%M:%S')}: "
            "clicked in more products already"))
            time.sleep(SCROLL_PAUSE_TIME+np.random.random())
        except:
            print((f"{datetime.datetime.now().strftime('%H:%M:S')}: "
            "'click more products' not found in iter {iter}"))
            browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            pass
        # Calculate new scroll height and compare with last scroll height
        new_height = browser.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            print(f'breaking at iter {iter} in no more scrolldown/height')
            break 
        last_height = new_height
        iter += 1

    # use beautifulsoup to digest resulting page
    soup = BeautifulSoup(browser.page_source,"html.parser")
    results = soup.find_all("div", class_="card js-masonry-item card-product product")
    browser.quit() # quit chrome/chromium

    ll = []
    # get list or lists [[product fields]]
    for i in range(len(results)):
        ll += [get_wallapop_fields(results[i])]

    # create a table with it
    df = pd.DataFrame(
        ll, # our list of lists 
        columns=['item_id', 'price', 'title', 'link', 'descr']
        ).set_index('item_id')
    df['keep']=False

    if len(words_included): # apply filtering in title/description
        pattern_in= '|'.join(words_included)
        df.loc[df.title.str.contains(pattern_in, case=False), 'keep'] = True
        df.loc[df.descr.str.contains(pattern_in, case=False), 'keep'] = True
    if len(words_excluded):
        pattern_out= '|'.join(words_excluded)
        df.loc[df.title.str.contains(pattern_out, case=False), 'keep'] = False
        df.loc[df.descr.str.contains(pattern_out, case=False), 'keep'] = False
    print(f'returning a crawled + filtered df with shape {df[df.keep].shape}')
    return df[df.keep] # just return those with keep flag

def report_(df,indexlist):
    """call report in those new indexes. 
    index is a list!"""
    strout = '' # string to be crafted
    for index in indexlist:
        price = df.loc[index, 'price']
        title = df.loc[index,'title']
        link = df.loc[index,'link']
        strout += f'found a new item for **{price}**\n**{title}**\nhere {link}\n\n'
    report(strout)


# run script
if __name__=='__main__':
    try:
        try:
            df_prev = pd.read_pickle('frame.pkl')
        except:
            print(('if first run, this is fine, '
            'else you should not read this aka something is not working'))
            df_prev= pd.DataFrame([])

        while True:
            continueflag = False
            dflist = []
            for url in urls:
                try:
                    dflist += [pull_wallapop(
                        url, 
                        words_included=words_included, 
                        words_excluded=words_excluded)]
                except Exception as e:
                    print((f'exception while pulling wallapop.\n{e}\n'
                    f'sleeping for {SLEEP_BETWEEN_CRAWL} and trying again'))
                    #raise e # uncomment to debug
                    continueflag  = True
                    break
            if continueflag:
                time.sleep(SLEEP_BETWEEN_CRAWL)
                continue

            curr_df = pd.concat(dflist)
            curr_df = curr_df[~curr_df.index.duplicated(keep='first')]

            if not len(df_prev): # first run, just save
                curr_df.to_pickle('frame.pkl')
            else:
                old_items = df_prev.index.values
                new_items = curr_df.index.values

                # check whether there's something new
                new_indexes = [x for x in new_items if x not in old_items]
                if len(new_indexes): # if so, report those items
                    print('found all these new item ids',new_indexes)
                    report_(curr_df,new_indexes)

                # update old and store. Clean manually from time to time*
                df_prev = pd.concat([df_prev,curr_df]) 
                # if ads get updated, this won't warn
                df_prev = df_prev[~df_prev.index.duplicated(keep='last')] 
                # saving table avoids spamming after restarting the script
                df_prev.to_pickle('frame.pkl') 

            print((f"{datetime.datetime.now().strftime('%H:%M:%S')}: "
            f"waiting {SLEEP_BETWEEN_CRAWL}s till next crawl"))
            time.sleep(SLEEP_BETWEEN_CRAWL)

    except Exception as e:
        topost = f'wallapop crawler died with exception:\n{e}'
        report(topost)

You can run it in a headless raspberry pi using ~1€/month power consumption. If that's the case, take a look at berryconda (saves a lot of time when installing pandas) and how to install chromedriver in armv7l.

Disclaimer: Do not abuse. I am not encouraging you to do anything illegal. Also, after some time this script stopped working on the raspberry (still worked in a windows box) so perhaps chromium/webdriver interaction was broken or it got black-listed somehow. This strategy is widely used, when talking with sellers, most of the times I was not the first to contact them (5 min latency max.)
You can find similar resources here and here in Spanish.

Cheers and happy shopping :)

EDIT
I just found this repo, which I haven't tried but looks way more complete + ready to deploy

EDIT2 Because this page is getting some views, here's an important addendum I did not know back when I wrote this post.
After trying to crawl sites reluctant to robots, I learned that when using headless mode, selenium sets user-agent to headless-chromium or something I cannot recall at the moment. In any case, we are telling plain-sight that our browser instance is run by a bot. Perhaps this is not what you want.

Check this repo or those posts (1, 2) for more info (or google it, there are a lot of results).

Cheers!

Show Comments