E-book saga continues: HTML scraping

Thursday, May 13, 2010 06:45 +0200

E-book saga continues: HTML scraping

As you might imagine, I'm "somewhat" busy working on my IPv6 summit presentation. I wrote this rant a while ago but somehow never managed to publish it.

In a comment to my piracy rant Steve asked how I feel about Safari. In principle, I like anything that brings my books to the readers in a more usable form, and Safari is a perfect idea: virtual bookshelf, searchable books, and temporary access to books you don’t need permanently ... The implementation, however, belongs to the previous century; it’s too easy to write a bot that scrapes the text from HTML and eventually collects the whole book.

I was acutely reminded of this problem when an anonymous reader posted a link in a comment to one of my CCNA-level posts. It was very obvious the source of the stolen material was an HTML-based e-book split into pages based on section headings (and Safari could be one of the convenient sources). Doing a few quick Google searches, I was able to find numerous other Cisco Press books available in the same “convenient” format (not to mention that half of the first page hits for many Cisco Press book titles point to rapidshare.net and its siblings). All I can say is: it’s amazing (and I’m so glad) you’re still buying dead-tree-based books.

6 comments:

Anon 13 May 2010 08:51

Hello Ivan, Digital books are ok, but they cannot be highlighted and annotated in the same easy way as a paper book, this is something that the iPad and it's competitors need to provide otherwise they will be no good for learning materials and study material. I continually annotate my books and highlight them. I also add little drawings and notes in all directions, arrows etc. I have not seen a digital system provide such features using a completely natural interface like a pen. (Apple should do something finger based, but at present I doubt it).

Stuart 13 May 2010 10:10

I don't like writing in books anyway, as if and when I use them for reference at a later date, all the notes and highlighting will get in the way of what I'm trying to learn. Plus, I don't find this a good way of interpreting information to yourself, like making your own separate notes would.

Therefore to me digital books are just as good as physical (analogue? :P) books

polecat 13 May 2010 14:47

Ivan, printed books are inferior in every possible aspect. As soon as there are reasonable reading devices for tech publication on the market (it seems they are close) all printed materials should go away.So why anyone whould cling on them? The only reason I can imagine is that it is easy to charge for them. But actually it is the same story as the teachers talking about inattentive students, most of the time the problem is thier own incompetence as a teacher. Also please note that the current system with printed books implies that the reader is charged before it is aware of the quality of the contents, which is a little bit unfair, not to say more.

Anonymous 13 May 2010 15:22

Publishing in a pdf format and then locking it down with the FileOpen plugin is pretty formidable. Give it an eval.

Jason 13 May 2010 23:21

I've grown to love e-books (purchased legitimately) - Being able to quickly search by keywords, scroll, highlight, print off specific pages. To each their own though.

That being said, when dad (I) get home at the end of the day, eBooks just aren't the same reading to my kids cuddling around me.

Justin 14 May 2010 07:44

If you ahve ever tried to read an O'reilly book or other PDF on an e-reader (other than the iPad) it sucks. e-readers are still only good for text, not diagrams and tables.