Bypassing dumb article paywalls¶
Nowadays there's a tendency to pay for other people's effort. Which is nice. However, it is annoying to get lured by a tweet or link and then not being able to read the content (in a place that you are not going to subscribe for an article a year). Some of them can (could?) be bypassed using incognito mode. Others don't. But there's a next step that you can make to bypass the paywall.
I am not OK with massive piracy but, you know, I guess this is what happens when things are poorly implemented. You may consider it edgy but i guess it is fine to apply some evolutionary pressure to our internet ecosystem. Like it happened with SQL databases back in 2010s.
How-to.¶
The other day, when checking my twitter feed I saw @fouroctets tweet about this (this idea is not mine). Unfortunately he has a lock now and I cannot share the tweet. In summary, you can read plaintext paywalled articles in page source (sometimes). This original tweet showed the "flaw" in a popular US journal. Then, I thought that the very mistake could be happening elsewhere. Thus, I'll check few Spanish journals using BeautifulSoup.
from bs4 import BeautifulSoup
import requests
import json
url = 'https://elpais.com/opinion/2021-09-01/vacunarse-es-una-obligacion-civica-y-solidaria.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# page.content # #uncomment for spam
here we can see that after "a pesar de su oposici\xc3\xb3n."
we have a /p><section class="more_info | border_1 border_top pull_right">...
just skip it until target="_blank">
and voilá! full article text is there (ends in </p><section id="ctn_closed_article"
)
# with this we can retrieve paragraphs
for item in soup.findAll('p', attrs={'class' : ""})[1:]:
# print(item.contents) # uncomment to print :)
# Won't share text that is not mine in the site
pass
body_content = []
for item in soup.findAll('p', attrs={'class' : ""})[1:-3]:
# concat paragraph content
c_content = item.contents
if c_content is None:
continue
if len(c_content)>1:
body_content += [
"".join([str(x) for x in c_content])
]
else:
body_content += [str(c_content[0])]
# print('\n\n'.join(body_content))
# look for </script><script type="application/ld+json">
# there we have **everything**, but hyperlinks are not present in bodyContent,
# just retrieve title and subtitle
scr = soup.findAll('script', type="application/ld+json")
d = json.loads(
scr[1].string
) # actual content
html_header = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
"""
with open("output.html", "w", encoding='utf-8') as file:
file.write(
(
html_header+
f"<h1>{d['headline']}</h1><h2>{d['description']}</h2>"+
'<p>'+('</p><p>'.join(body_content))+'</p>'+ # join paragraphs with p tags
'\n '.join([
f'<img src="{d["image"][i]["url"]}" width="280">' for i in range(len(d['image']))
# append images at the bottom. TODO: since this might be crucial in some articles,
# it would be nice to rework it at some point so that they get inserted where they should
])
)
)
# once all of them work, try using some css so it is not that ugly
And that's it. We have an ugly html with the "paywalled" content.
You can find a small utility here, to make your life easier (supports several sites, tested in ubuntu).
I'd like to thank AFont24 for his help, suggestions and pull requests while preparing the repo.
Cheers!