Reading Spanish articles behind paywalls

small update: there are tools out there that outperform the post below. e.g. this or this which exploits google's cached views of those articles.
Such a nice reminder that a proper search might result in something better than simply code an idea from scratch

Bypassing dumb article paywalls

Nowadays there's a tendency to pay for other people's effort. Which is nice. However, it is annoying to get lured by a tweet or link and then not being able to read the content (in a place that you are not going to subscribe for an article a year). Some of them can (could?) be bypassed using incognito mode. Others don't. But there's a next step that you can make to bypass the paywall.
I am not OK with massive piracy but, you know, I guess this is what happens when things are poorly implemented. You may consider it edgy but i guess it is fine to apply some evolutionary pressure to our internet ecosystem. Like it happened with SQL databases back in 2010s.

How-to.

The other day, when checking my twitter feed I saw @fouroctets tweet about this (this idea is not mine). Unfortunately he has a lock now and I cannot share the tweet. In summary, you can read plaintext paywalled articles in page source (sometimes). This original tweet showed the "flaw" in a popular US journal. Then, I thought that the very mistake could be happening elsewhere. Thus, I'll check few Spanish journals using BeautifulSoup.

In [24]:
from bs4 import BeautifulSoup
import requests
import json

url = 'https://elpais.com/opinion/2021-09-01/vacunarse-es-una-obligacion-civica-y-solidaria.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
In [25]:
# page.content # #uncomment for spam

here we can see that after "a pesar de su oposici\xc3\xb3n." we have a /p><section class="more_info | border_1 border_top pull_right">... just skip it until target="_blank"> and voilá! full article text is there (ends in </p><section id="ctn_closed_article")

In [27]:
# with this we can retrieve paragraphs
for item in soup.findAll('p', attrs={'class' : ""})[1:]:
    # print(item.contents) # uncomment to print :) 
    # Won't share text that is not mine in the site
    pass
In [74]:
body_content = []
for item in soup.findAll('p', attrs={'class' : ""})[1:-3]: 
    # concat paragraph content
    c_content = item.contents
    if c_content is None:
        continue
    if len(c_content)>1:
        body_content += [
            "".join([str(x) for x in c_content])
        ]
    else:
        body_content += [str(c_content[0])]
# print('\n\n'.join(body_content))
In [28]:
# look for </script><script type="application/ld+json">
# there we have **everything**, but hyperlinks are not present in bodyContent, 
# just retrieve title and subtitle
scr = soup.findAll('script', type="application/ld+json")
d = json.loads(
    scr[1].string
) # actual content
In [ ]:
html_header = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
"""
with open("output.html", "w", encoding='utf-8') as file:
    file.write(
        (
            html_header+
            f"<h1>{d['headline']}</h1><h2>{d['description']}</h2>"+
            '<p>'+('</p><p>'.join(body_content))+'</p>'+ # join paragraphs with p tags
            '\n  '.join([
                f'<img src="{d["image"][i]["url"]}" width="280">' for i in range(len(d['image'])) 
                # append images at the bottom. TODO: since this might be crucial in some articles,
                # it would be nice to rework it at some point so that they get inserted where they should
            ])
        )
    )
# once all of them work, try using some css so it is not that ugly

And that's it. We have an ugly html with the "paywalled" content.
You can find a small utility here, to make your life easier (supports several sites, tested in ubuntu).
I'd like to thank AFont24 for his help, suggestions and pull requests while preparing the repo.
Cheers!

Show Comments