Discussion:
Web application - CURL and Cookies?
(too old to reply)
birchy
2008-11-11 08:03:01 UTC
Permalink
I have a need to develop a web app that can do the following:
1) interact with a web server via http
2) login to the site and maintain a session via cookies
3) parse the html

There is a large selection of libraries that can do this in Java, but i'm
much more familiar with VB6/Gambas. As far as i understand, the solutions
are:
1) gb.net.curl
2) gb.net.curl
3) gb.xml

I have read that curl can handle cookies, but have no idea how to do it in
Gambas - are there any code samples? gb.xml is based on libxml2 and i have
read that libxml2 can also handle html...but again, i cannot find any
information on how to do this in Gambas via gb.xml as the documentation is
very incomplete. Can anyone confirm that Gambas is capable of producing my
web app, or would i be better off using Java?
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p20435022.html
Sent from the gambas-user mailing list archive at Nabble.com.
GarulfoUnix
2008-11-11 10:30:48 UTC
Permalink
Post by birchy
1) interact with a web server via http
First, you have the HttpClient class into the gb.net.curl, so
there is no problem to interact with a server web.
Post by birchy
2) login to the site and maintain a session via cookies
In seconde place, you have also a Session class which allows you
to manage sessions. Take a look at this for more information :
http://www.gambasdoc.org/help/comp/gb.web/session
Post by birchy
3) parse the html
To my knowledge, there is no parser available in Gambas.
Post by birchy
There is a large selection of libraries that can do this in Java, but i'm
much more familiar with VB6/Gambas. As far as i understand, the solutions
1) gb.net.curl
2) gb.net.curl
3) gb.xml
I have read that curl can handle cookies, but have no idea how to do it in
Gambas - are there any code samples? gb.xml is based on libxml2 and i have
read that libxml2 can also handle html...but again, i cannot find any
information on how to do this in Gambas via gb.xml as the documentation is
very incomplete. Can anyone confirm that Gambas is capable of producing my
web app, or would i be better off using Java?
What you want, Gambas may do it. You should try to find examples
for each component that you need. Do you know the GambasForge
website ? It's a website which regroups a lots Gambas programs
with their source codes. Here is the link :

http://www.gambasforge.net/cgi-bin/index.gambas

However, if you meet problems with it, don't be amazed. This one
is under development at this moment in time.

Regards,

GarulfoUnix.
Benoit Minisini
2008-11-11 14:55:45 UTC
Permalink
Post by birchy
1) interact with a web server via http
2) login to the site and maintain a session via cookies
3) parse the html
There is a large selection of libraries that can do this in Java, but i'm
much more familiar with VB6/Gambas. As far as i understand, the solutions
1) gb.net.curl
2) gb.net.curl
3) gb.xml
I have read that curl can handle cookies, but have no idea how to do it in
Gambas - are there any code samples?
No, but by reading the source code, I can say that:

HttpClient.CookieFile allows you to define a file where cookies will be read.

And HttpClient.UpdateCookies should tell if the previous file will be updated
during the HTTP request.
Post by birchy
gb.xml is based on libxml2 and i have
read that libxml2 can also handle html...but again, i cannot find any
information on how to do this in Gambas via gb.xml as the documentation is
very incomplete.
Some variant of HTML can be parsed with an XML parser. But most the time it
does not work, as HTML is not strict enough.

But I just read on the libxml2 website that this library knows how to parse
HTML too. So if you find a volunteer to implement it in the gb.xml component?
Post by birchy
Can anyone confirm that Gambas is capable of producing my
web app, or would i be better off using Java?
No idea. It depends on what you need exactly.

Regards,
--
Benoit Minisini
Benoit Minisini
2008-11-11 15:01:49 UTC
Permalink
Post by Benoit Minisini
Some variant of HTML can be parsed with an XML parser. But most the time it
does not work, as HTML is not strict enough.
But I just read on the libxml2 website that this library knows how to parse
HTML too. So if you find a volunteer to implement it in the gb.xml component?
I talked too fast: gb.xml already know how to parse HTML. The XMLDocument
class has a 'HTMLFromString' method for that.

Maybe a better name would be 'FromHTMLString'. Anyway, it is there!

Regards,
--
Benoit Minisini
birchy
2008-11-30 16:19:11 UTC
Permalink
Post by Benoit Minisini
I talked too fast: gb.xml already know how to parse HTML. The XMLDocument
class has a 'HTMLFromString' method for that.
Where is the documentation for this? How do i use it?
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p20759131.html
Sent from the gambas-user mailing list archive at Nabble.com.
birchy
2008-12-02 12:43:56 UTC
Permalink
bump....because i'm really trying to avoid learning java instead...
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p20791388.html
Sent from the gambas-user mailing list archive at Nabble.com.
Benoit Minisini
2008-12-02 13:50:47 UTC
Permalink
Post by birchy
bump....because i'm really trying to avoid learning java instead...
DIM hXmlDoc AS NEW XmlDocument
DIM sHTML AS String

sHTML = GetHTMLFromSomewhere()

hXmlDoc.HTMLFromString(sHTML)

...
--
Benoit Minisini
birchy
2008-12-02 14:10:15 UTC
Permalink
Post by Benoit Minisini
hXmlDoc.HTMLFromString(sHTML)
Thank you for your reply. I setup a quick test app and using the
http://www.google.co.uk google homepage HTML gives me lots of errors when i
call:

hXmlDoc.HTMLFromString(sHTML)

The errors are:
HTML parser error : Tag nobr invalid
age().src='/images/nav_logo3.png'" topmargin=3 marginheight=3><div
id=gbar><nobr

^
HTML parser error : htmlParseEntityRef: expecting ';'
r><nobr>Web "http://images.google.co.uk/imghp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
r.qs(this) class=gb1 Images "http://maps.google.co.uk/maps?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
ar.qs(this) class=gb1 Maps "http://news.google.co.uk/nwshp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
bar.qs(this) class=gb1 News "http://www.google.co.uk/prdhp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
.qs(this) class=gb1 Shopping "http://mail.google.com/mail/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
ll &#9660;</small> <div id=gbi> "http://video.google.co.uk/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
qs(this) class=gb2 Video "http://groups.google.co.uk/grphp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
qs(this) class=gb2 Groups "http://books.google.co.uk/bkshp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
s(this) class=gb2 Books "http://scholar.google.co.uk/schhp?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
is) class=gb2 Scholar "http://finance.google.co.uk/finance?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
s(this) class=gb2 Finance "http://blogsearch.google.co.uk/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
class=gb2 <div class=gbd></div></div> "http://uk.youtube.com/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
) class=gb2 YouTube http://www.google.com/calendar/render?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
wc Calendar "http://picasaweb.google.co.uk/home?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
ck=gbar.qs(this) class=gb2 Photos "http://docs.google.com/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
class=gb2 Documents http://www.google.co.uk/reader/view/?hl=en&tab

^
HTML parser error : htmlParseEntityRef: expecting ';'
/?hl=en&tab=wy Reader http://sites.google.com/?hl=en&tab

^
HTML parser error : Tag nobr invalid
<div align=right id=guser style= <nobr

^
HTML parser error : htmlParseEntityRef: expecting ';'
r style="font-size:84%;padding:0 0 4px" width=100%><nobr> /url?sa=p&pref

^
HTML parser error : htmlParseEntityRef: expecting ';'
<nobr> /url?sa=p&pref=ig&pval

^
HTML parser error : htmlParseEntityRef: expecting ';'
t-size:84%;padding:0 0 4px <nobr> /url?sa=p&pref=ig&pval=3&q

^
HTML parser error : htmlParseEntityRef: expecting ';'
l?sa=p&pref=ig&pval=3&q=http://www.google.co.uk/ig%3Fhl%3Den%26source%3Diglk&usg

^
HTML parser error : htmlParseEntityRef: expecting ';'
href=
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p20792906.html
Sent from the gambas-user mailing list archive at Nabble.com.
Benoit Minisini
2008-12-02 14:21:52 UTC
Permalink
Post by birchy
Post by Benoit Minisini
hXmlDoc.HTMLFromString(sHTML)
Thank you for your reply. I setup a quick test app and using the
http://www.google.co.uk google homepage HTML gives me lots of errors when
hXmlDoc.HTMLFromString(sHTML)
HTML parser error : Tag nobr invalid
age().src='/images/nav_logo3.png'" topmargin=3 marginheight=3><div
id=gbar><nobr
^
HTML parser error : htmlParseEntityRef: expecting ';'
r><nobr>Web "http://images.google.co.uk/imghp?hl=en&tab
...
And so? You found less errors than the W3C HTML validator than returns sixty
errors for the same page. :-)

Maybe you should try some better HTML pages...
--
Benoit Minisini
birchy
2008-12-02 14:33:42 UTC
Permalink
Post by Benoit Minisini
Maybe you should try some better HTML pages...
Or maybe i should step backwards and manually parse my targets with InStr()
and Mid(). OR, learn Java and use something like TagSoup which can
(allegedly) handle "wild" html...
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p20793347.html
Sent from the gambas-user mailing list archive at Nabble.com.
Benoit Minisini
2008-12-02 14:42:03 UTC
Permalink
Post by birchy
Post by Benoit Minisini
Maybe you should try some better HTML pages...
Or maybe i should step backwards and manually parse my targets with InStr()
and Mid(). OR, learn Java and use something like TagSoup which can
(allegedly) handle "wild" html...
If I understand, you want to parse HTML as a browser can do. This is one of
the most complex things to do. There is almost no correct HTML page on the
web (it is difficult to do so), and two browsers will render them
differently.

Maybe if you explain precisely what you need I could give you some advice?
--
Benoit Minisini
birchy
2008-12-30 20:31:57 UTC
Permalink
Post by Benoit Minisini
Maybe if you explain precisely what you need I could give you some advice?
Hi, sorry for the delay, i have been busy with other things. I now have a
number of questions as i have been using a Python binding of CURL called
PyCurl...

1) Does Gambas implement the libcurl library or does it use curl via a
command prompt?

2) I have still not found a satisfactory HTML parsing library because many
of the values i want to extract are within JavaScript tags. Many people
suggest using BeautifulSoup, but i don't know if i can use this within
Gambas?? I can parse the document manually using string functions but it can
get quite ugly. I have read that Perl has a Tokenizer library which may be
useful for my purposes. My basic requirement is to be able to extract
specific sections of html which i can then parse manually. For instance,
lets say we have a table in our html, and it is the first one:

<table width="100%" border=0>
<tr>
<td width="30%"> index.php images/image1.gif </td>
<td> images/image2.jpg </td>
</tr>
</table>

What i would like to do is have a function something like: myString =
GetTable(0), which would return a string containing all of the above text. I
think this is how a Tokenizer works, though i'm not sure? Do you have any
suggestions other than using string functions?
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p21222728.html
Sent from the gambas-user mailing list archive at Nabble.com.
Rob
2008-12-30 20:55:34 UTC
Permalink
Post by birchy
1) Does Gambas implement the libcurl library or does it use curl via a
command prompt?
Gambas (specifically gb.net.curl) uses libcurl.
Post by birchy
2) I have still not found a satisfactory HTML parsing library because
many of the values i want to extract are within JavaScript tags. Many
people suggest using BeautifulSoup, but i don't know if i can use this
within Gambas?? I can parse the document manually using string functions
If the document you want to parse is XHTML, you could try using gb.xml.
Unfortunately, that seems to be undocumented right now.

If it's a legacy HTML document, libxml does have an HTML parser available,
but it seems that gb.xml doesn't expose that functionality at present so
you'd need to use external function declarations to use it.

Rob
Ron_1st
2008-12-30 21:22:54 UTC
Permalink
Post by Rob
Post by birchy
1) Does Gambas implement the libcurl library or does it use curl via a
command prompt?
Gambas (specifically gb.net.curl) uses libcurl.
Post by birchy
2) I have still not found a satisfactory HTML parsing library because
many of the values i want to extract are within JavaScript tags. Many
people suggest using BeautifulSoup, but i don't know if i can use this
within Gambas?? I can parse the document manually using string functions
If the document you want to parse is XHTML, you could try using gb.xml.
Unfortunately, that seems to be undocumented right now.
If it's a legacy HTML document, libxml does have an HTML parser available,
but it seems that gb.xml doesn't expose that functionality at present so
you'd need to use external function declarations to use it.
Rob
I found problems with xml parsing while the html pages are not
always valid written but could pass w3c check.
Second it needs the XML doctype as first line and that
is not available when the page is get from the web.
Cold solved by adding yourself of course but then it is
also required the tags are lowercase.
A global to lowercase cant be done, contest corrupting, and do it
for tags and attributes only is just to point how to get them correct.
its a part of chicken and egg here.


Good 2009 to all

Best regards,

Ron_1st
birchy
2008-12-30 21:47:17 UTC
Permalink
Post by Rob
Gambas (specifically gb.net.curl) uses libcurl.
Excellent. Regarding cookie handling, gb.net.curl has HttpClient.CookieFile
which allows you to define a file for cookies. According to the libcurl
tutorial http://curl.netmirror.org/libcurl/c/libcurl-tutorial.html here
Post by Rob
The CURLOPT_COOKIEFILE option also automatically enables the cookie parser
in libcurl. Until the cookie parser is enabled, libcurl will not parse or
understand incoming cookies and they will just be ignored. However, when
the parser is enabled the cookies will be understood and the cookies will
be kept in memory and used properly in subsequent requests when the same
handle is used. Many times this is enough, and you may not have to save
the cookies to disk at all. Note that the file you specify to
CURLOPT_COOKIEFILE doesn't have to exist to enable the parser, so a common
way to just enable the parser and not read able might be to use a file
name you know doesn't exist.
I assume that gb.net.curl uses this same behaviour? If so, simply setting
ANY value for HttpClient.CookieFile() should enable automatic cookie
handling? If this is the case, perhaps an attribute such as
HttpClient.AutomateCookies(boolean) could be added in order to remove the
need for manual cookie handling? It does the same thing as setting
HttpClient.CookieFile() but just makes it much easier to understand. Most of
the time, the end user is not actually interested in the cookie data
itself...all he or she wants is to maintain a login session by sending the
correct cookies.
Perhaps a similar automation could be added for handling gzip as well?
Again, it's an unnecessary manual job that could be turned into a one liner:
HttpClient.AutomateGzip(boolean). All it requires is adding the
'Accept-Encoding: gzip' header to every request and then decompress the
response if the 'Content-Encoding: gzip' header is present.
Post by Rob
you could try using gb.xml. Unfortunately, that seems to be undocumented
right now.
As discussed previously in this thread
http://www.nabble.com/Re%3A-Web-application---CURL-and-Cookies--p20793152.html
here , gb.xml does expose the html parsing element of libxml, however the
results are extremely poor.

I guess that manual parsing with string functions is the only accurate way
to achieve this.
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p21223619.html
Sent from the gambas-user mailing list archive at Nabble.com.
Rob
2008-12-30 22:20:52 UTC
Permalink
Post by birchy
I guess that manual parsing with string functions is the only accurate
way to achieve this.
I'd probably try to write a class that wrapped gb.pcre to do that, but I
would say that, since I wrote gb.pcre ;)

Rob
birchy
2008-12-30 23:06:11 UTC
Permalink
Post by Rob
I'd probably try to write a class that wrapped gb.pcre to do that, but I
would say that, since I wrote gb.pcre ;)
Well i'd have to learn regex first, but i am already bald, so i have no hair
to pull out. :o)

What do you think about my other suggestions regarding automating cookie and
gzip handling? In this day and age, i'm surprised that we still have to
handle them manually.
--
View this message in context: http://www.nabble.com/Web-application---CURL-and-Cookies--tp20435022p21224504.html
Sent from the gambas-user mailing list archive at Nabble.com.
Loading...