flat assembler
Message board for the users of flat assembler.

Index > Heap > WinXP: Convert HTML to text

Author
Thread Post new topic Reply to topic
pete



Joined: 20 Apr 2009
Posts: 110
pete
Hello!

Does anyone know a useful command line tool in Windows XP that converts HTML formatted text to to a simple text file so i don't need a web browser but can view the site in a text-editor?
Is there a builtin-function in Windows or the Office Products?

Actually i want to write a batch-file that downloads a website, converts it to TXT on a regular basis.
Post 14 Oct 2009, 06:56
View user's profile Send private message Reply with quote
Raedwulf



Joined: 13 Jul 2005
Posts: 375
Location: United Kingdom
Raedwulf
You want wget. There's a windows port for it over at GnuWin32.

_________________
Raedwulf
Post 14 Oct 2009, 10:22
View user's profile Send private message MSN Messenger Reply with quote
pete



Joined: 20 Apr 2009
Posts: 110
pete
Thanks for the reply Raedwulf. Well, i got an account to a linux machine where i can use a text browser and redirect the output to a file and send to myself via mail.
Post 14 Oct 2009, 13:37
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
You could probably write the tool you want pretty quickly in FASM.

Windows has an api for downloading a file from the web and culling out html tags can be done (fairly reliably) by looping through the data a byte at a time and checking for "<" (push stackvar++) and ">" (pop stackvar--) and only displaying characters when the stackvar is 0.
Post 14 Oct 2009, 14:57
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

You want wget. There's a windows port for it over at GnuWin32.

Does it actually do that or it will just save the file as it comes from the server?

r22, I think you'll still get readability problems that way, you need to actually parse the tags a little to add the spaces and \n where necessary.
Post 14 Oct 2009, 16:27
View user's profile Send private message Reply with quote
Remy Vincent



Joined: 16 Sep 2005
Posts: 155
Location: France
Remy Vincent
LocoDelAssembly wrote:

... Does it actually do that or it will just save the file as it comes from the server ...


- Downloading with API fails immediatly if data is written to DiskFile.

Also there is no very fast RAMDRIVE to have it working writing to a RAMDRIVE Diskfile ...

- Downloading with API works ok immediatly if data is written to a big RAM BUFFER. But sorry i've got in pascal language, and as far as I remember, i'm still trying to improve memory allocation by myself to avoid the use of a TStrings pascal object!
Post 14 Oct 2009, 16:40
View user's profile Send private message Visit poster's website Reply with quote
sleepsleep



Joined: 05 Oct 2006
Posts: 8897
Location: ˛                             ⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣⁣Posts: 334455
sleepsleep
Quote:

Thanks for the reply Raedwulf. Well, i got an account to a linux machine where i can use a text browser and redirect the output to a file and send to myself via mail.

is it possible to use such way to browse internet? unless u browse a html page that is maybe 100 pages long. otherwise how many email you gonna received per each surfing?? sorry, i don't understand the logic behind your idea.
Post 14 Oct 2009, 16:41
View user's profile Send private message Reply with quote
pete



Joined: 20 Apr 2009
Posts: 110
pete
@r22: yeah, first i thought about writing the tool in FASM, but like LocoDelAssembly said, this could get complicated, because i want an easy readable text, also when HTML-tables are used on the website.

@LocoDelAssembly: no, wget does not what i really want to do: converting HTML to plain text. It seems wget can be used to mirror a website or a single page of website.

@sleepsleep: my goal was to get the contents of a website once a week per mail. The url to this site is pretty long and ugly, i don't use bookmarks and i don't want to see the ugly color formating of the site anymore. I simply need the text! Getting the updates per mail is very comforting for me since i use a very slim mail program, a way faster than using a web browser.

My solution was to configure a cron-job on a linux-system which runs a bash-script once a week. The bash script simply opens the website using a textbrowser and redirects the output to the mail program which discards a mail. I receive a very well formatted, plain-text mail message.
Post 15 Oct 2009, 11:35
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
to make a text using html:

parse all tags (<br><div><table> etc...) as a null tag (do nothing)
the only text you'll see is what is displayed in the browser, but without any formating, pure ascii only.

i've made it last year for the [not dead]contest/challenge[/not dead].
Post 15 Oct 2009, 20:36
View user's profile Send private message Visit poster's website Reply with quote
ManOfSteel



Joined: 02 Feb 2005
Posts: 1154
ManOfSteel
Nothing beats sed or tr.


edfed wrote:
parse all tags (<br><div><table> etc...) as a null tag (do nothing)

All, *but* <br>, which should be replaced by a line feed, or pages will be pretty ugly.
Post 15 Oct 2009, 20:46
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
yes, sorry, some tags are parsed as text event, like BR. of course.
Post 15 Oct 2009, 21:15
View user's profile Send private message Visit poster's website Reply with quote
pete



Joined: 20 Apr 2009
Posts: 110
pete
edfed, using w3m is still better than parsing the HTML, because it can generate very nice plain-text tables out of HTML ones.

ManOfSteel, thanks for the hints about sed and tr. I man'ed both (is that correct *nix style?) and will try to remind them when i need them again. Most of my text-manipulating work can be done in vim, though.
Post 16 Oct 2009, 06:24
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
I use Total Commander built-in viewer for that.
Post 16 Oct 2009, 11:52
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
pelaillo
Missing in inaction


Joined: 19 Jun 2003
Posts: 878
Location: Colombia
pelaillo
Lynx does a good job on formatting web pages, even tables are well formed.
Post 16 Oct 2009, 13:23
View user's profile Send private message Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.