PixieWare Software: PixieRobot Manual
Website Design and Data Extract Solutions

By Dave Johnstone



PixieRobot Web Scraping Programmers Manual


Manual Contents

1. Inroduction
2. Programming a conversation: in brief
3. Scripting
3.1 ExecuteWWW
3.2 XForm
3.3.1 XCommon
3.3.2 Xrun
3.4 OutputToFile
3.5 FileGetText
3.6 LoggedMessage
3.7 LogMessage
3.8 Monitor
3.9 Silent
3.10 The Document Object
3.11 Extracting Pictures
3.11.1 GetImageList
3.11.2 GetImageByNumber
3.11.3 GetImageByName
3.11.4 GetFilenameFromAddress


1. Inroduction

PixieRobot's WWW functions are used to extract unstructured data from the WWW, and reformat it into structured data formats such as spreadsheets and databases.

PixieRobot now provides a script interface for driving the Internet Explorer browser object. Example script:

  • Navigation to web pages: eg
    Call ExecuteWWW("http://www.pixieware.com")
      
  • Filling in of forms and submission of them: eg
    XForm("FirstName").Value = "Queen"
    XForm("LastName").Value = "Cleopatra"
    XForm("Id").Value = "GN93022"
    XForm("ItemForSale").Value = "Snake Charming Whistle"
    XForm("ReservePrice").Value = "30.00"
    Call ExecuteWWW("SUBMIT")
      
  • Analysis of received data in a webpage: eg
    s = WWW.Document.Body.InnerHTML
    iPos = Instr(1, s, "<input name=""HighestOffer""", 1)
    If iPos > 0 Then
      iPos = Instr(iPos, s, "value=""", 1)
      If iPos > 0 Then 
        sValue = Mid(s, iPos+7, Instr(iPos+7, s, """") - iPos - 8)
      End If
    End If
  • Output of extracted data to a local file: eg
    '----- Set up static fields for output
    oSite = "ABC"
    oDate = Date
    oTime = Time

    ' Find home team from web page
    mPos = Instr(mPos, u, "BLK", 1)
    mPos = mPos + 4
    nPos = Instr(mPos, u, "<", 1)
    hTeam = mid(u, mPos, nPos - mPos)

    ' Now build a string of output fields and create new record for XLS file.
    o = o & oSite & chr(9) & oDate & chr(9) & oTime & chr(9) & hTeam & chr(9)
    oFile = oFile & o & vbCrLf

    ' The XLS file extension will cause MS_Excel to read the created string (oFile) and
    ' format the spread sheet into columns and rows using the chr(9) and vbCrLf controls.
    Call OutputToFile(oFile, "ABC.xls")


2. Programming a conversation: in brief

Scripts are plain text files written in the VBSCRIPT language.  By default the browsing script is called "onInterval.txt"
but you can change its name in the configuration form:  Menu:Config ---> WWW, 
or edit PixieRobot.ini eg
ScriptOnInterval=Cleopatra.txt

You need to name your master routine
Sub Main

The script statement: Monitor=True causes the conversation to be made visible. 
Monitor=True should always be the first line in a new Sub Main.
Change to Monitor=False when you "go production".

You then script one step at a time, test-running with Menu:WWW --> Immediate  (Hotkey = F7).
At each step the monitor, which is built with objects from Internet Explorer, displays the result. 
Click the "WriteToFile" button to save the current page for further inspection.
For a faster way to work, you can right-click on the displayed page area, then select "View Source" from the popup menu.

Most commonly-used library functions of those we have added to VBSCRIPT:
ExecuteWWW, a function used for navigation and form submission.
XForm,an object, consisting of the collection of elements in the first page form. Mostly used for data entry into pages, but also for using the document object model for scraping desired data out of a target page.


3. Scripting

Scripts are plain text files written in the VBSCRIPT language.

Features and points of the PixieRobot implementation.

  • You have one master WWW script which is fired on a timer interval.
    Write it as a text file onInterval.txt and write your main routine as a subroutine called Sub Main
    eg
    Sub Main
      Monitor = True
      Call ExecuteWWW("http://www.pixieware.com")
    End Sub
      
  • If you would like to call the script something other than "onInterval.txt", eg "myScript.vbs", then use the Menu:Config interface to set the name of the script to run.

  • The PixieRobot methods and properties are supplied as an implicit object, ie functions such as ExecuteWWW and FileGetText become part of the VBSCRIPT language for this environment.
    If you wish to write in object syntax, that implicit object is called "IW", so both of the following mean the same:
    Call IW.ExecuteWWW("http://www.pixieware.com")
    Call ExecuteWWW("http://www.pixieware.com")
     

  • Some VARs may wish to script using the System.FileSystemObject. An implicit one of these named FS is already provided, and you can use all of its methods and properties directly. eg
    Set drv = GetDrive(GetDriveName(drvPath))
    NOTE that you still need to set and use sub-objects eg drv above


General VBSCRIPT note:

  • The For ... Next loop requires that the closing keyword "Next" be on its own with no variable with it.  That can be tricky to get used to, especially with the less than helpful error message it gives: 
    "1025, Expected end of statement"  eg:
    '
    'distribute multiple attachments
    For i = 1 To AttachNumber
      Call AttachMove(i, sDestination(i))
    Next
    'note this must NOT be "Next i"


IW (Internet WWW Object): Details of the scripting properties and methods Grouped by purpose, all keywords and file/path strings are case-insensitive.

3.1 ExecuteWWW

Call ExecuteWWW( URL, [PostData], [EndOfTx] 
stringData = ExecuteWWW( URL, [PostData], [EndOfTx] )
stringData = ExecuteWWW( SUBMIT, [PostData])

Navigate to a page, or submit a form by using "SUBMIT" as the URL argument

Parameter

Description

Url

String: Address to navigate to, OR command "SUBMIT"

PostData

String: Optional data to submit in low-level formatlike"name=Michael&age= 37".
Or, the relative number on the page of the form being submitted.
Form count is zero based so the first form on the page has a number 0.

EndOfTx

String: Optional text to recognise as end of transmission.  This is a bandwidth-saving device so when this text arrives in the page download, PixieRobot knows that it has received all that you need and can stop further downloading.  Use distinctive text from near the end of the page.

3.2 XForm

XForm(FieldNameOrIndex [,FormIndex] ) [(ArrayIndex)].Value = Value
Call XForm( FieldNameOrIndex [,FormIndex]
) .Click

Manipulate a form field directly through an abbreviation of the HTML Document Object Model eg:
XForm("Password").Value = "biscuit

Note, this is equivalent to:
WWW.Document.Forms(0).Elements("Password").Value = "biscuit"< /FONT >

but it is much easier to type.  You may need the longer WWW "low level" version if the page uses unusual on-form-submission methods eg "remote scripting" or "field-by-field".

Parameter

Description

FieldNameOrIndex

String or Integer: Field Name or Index to identify field to manipulate

ArrayIndex

Optional Integer, only needed when multiple fields have the same name

Value

Value for the field, usually string, but can be numeric, or for checkboxes, boolean =True/False


More XForm Examples:

XForm("chkAutoTrans", 1).Checked = True

There are 2 forms on this page, and the checkbox "chkAutoTrans" that
we want to tick is in the second form. The first, the default form has a
FormIndex of 0, so a second form needs a FormIndex specified of 1.

xPrice2 = XForm("Price2").Value
Read the value of field "Price2" into variable 'xprice2'.

XForm("optType")(2).Checked = True
There is a group of option radiobuttons They all have the name "optType".
You want to check the 3rd one, which requires an ArrayIndex of 2 because they start counting from 0.

3.3.1 XCommon

A common area used to pass data between a calling VBScript program and a called VBScript.

Examples:
XCommon("DelCountry") = oDelCountry
XCommon("ProviderRef") = "32319970815"

3.3.2 XRun

A VBSCRIPT program may be broken up into separate scripts and PixieRobot provides a method for one script to call another. This is the PixieRobot proprietary function "XRun".

Example:
sRet = XRun("TESTMOCK.vbs")

3.4 OutputToFile

Call OutputToFile( Data, Destination, True )
Write a string of data as a disk file. Very useful for logging results of scripting when developing scripts.

Parameter

Description

Data String to write to disk as data

Destination

String: full path of new file including file name

True

Appends at the end of an existing file. Else will overwrite contents of file. Optional.

3.5 FileGetText

stringData = FileGetText(File_with_full_path_name)
Read contents of a file into stringData.  Useful for reading the contents of an attachment to feed into other systems like databases.  eg

stringData = FileGetText(PathAttachIn AttachFile(Index))

Parameter Description
MessageString Message to log
Destination String: full path of new file including file name

3.6 LoggedMessage

Call LoggedMessage
String Property, returns the previous logmessage sent to the logfile and monitor.

3.7 LogMessage

Call LogMessage(Message)
Displays string Message on the PixieRobot Monitor, as well as writing it to its logbook.

3.8 Monitor

The script statement Monitor=True causes the conversation to be made visible. Monitor=True should always be the first line in a new Sub Main, to allow for debugging. Change to Monitor=False when you "go production".

Examples:

Monitor = False
Monitor = True

3.9 Silent

The script statement Silent=True causes "pop-up boxes" to be ignored. Example:
Silent = True


3.10 The Document Object

URL Extraction Using The Document Object

The Document Object represents the HTML document in a given browser window. Use the document object to examine, modify, or add content to a HTML document and to process events within that document. The URL property sets or retrieves the URL for the current document.e.g. wFi = Mid(WWW.Document.URL, 1, 25) Returns: http://www.abcsports.com and places 25 characters of the URL in variable - wFi

A specific use for the URL property with PixieRobot could be if a web farm is encountered. Web farms are set-up to handle large visitor numbers by having multiple web servers to process requirements. Which web server you get is randomly selected when a session is first established. If you try entering a constant URL (e.g. www4.abcsports.com) it will be ignored and its own URL address is returned. From then on, the session variable for your allocated web farm, needs to match up.

So for example the following PixieRobot script will extract the server variable for subsequent use.

Sub Main
' ABC Sports Web Farm Test
' PixieRobot command to run script manually
Monitor = True
' PixieRobot command to ignore pop-up windows while running
Silent = True
' PixieRobot command to navigate to a web page
s = ExecuteWWW("http://www.abcsports.com")
' PixieRobot command to obtain the URL and extract the web farm address in - wFi
' The URL returned is: http://www4.abcsports.com and the 11th character from left is web farm identifier
wFi = Mid(WWW.Document.URL, 11, 1)
' Navigate to new web page after combining all elements of URL
s = ExecuteWWW("http://www" & wFi & ".abcsports.com/"
End Sub


Other ways of using PixieRobot to navigate web pages include:

Setting a Form Element value: WWW.Document.Forms(0).Elements("zipcode").value = "10010" Ckecking if a page is loaded: If WWW.document.ReadyState = "complete" Then Setting a Form Elment Index value: WWW.Document.Forms(1).Elements("cspecialty").selectedIndex = 0 Retrieving a Form Element Value: a2=WWW.Document.Forms(1).Elements("cspecialty").Value Clicking an Element on a Form: Call www.document.forms(1).elements(9).click


3.11 Extracting Pictures

These functions are intended to be used to get a list of URL's off the current page that the script has been positioned at. Then used to extract the required picture by its index reference or by its name, and then store the picture in the folder specified. The following code is an example:

If Instr(1, t, "nophoto", 1) <> 0 Then
   call logMessage ("No Picture details page")
   oPicid = "None"
Else
   iMglist = Split(GetImageList(), Chr(254))
   For i = 0 To Ubound(iMglist)
      ' The next line searches for an image with "auto" in the name
      If Instr(1, iMglist(i), "auto", 1) > 0 Then
         oPicid = iMglist(i)
         on error resume next
         ' The next line downloads the image to the specified folder
         Call GetImageByNumber(i, "C:\Prog Files\PR\djphotos")
         on error goto 0
         Exit For
      End If
   Next
End If

3.11.1 GetImageList

Public Function GetImageList() As String()

GetImageList - This function returns an array containing the URLs of every
image on the document. This array is zero-based. You can use the indices
of this array in the GetImageByNumber function call, or you can put the
URL through the GetFilenameFromAddress function and pass the returned
filename to the GetImageByName function.

3.11.2 GetImageByNumber

Public Function GetImageByNumber(ByVal index As Integer, ByVal directory As String) As String

GetImageByNumber - This function downloads an image based on it's index in
the web page.

  • Index: The index of the image
  • Directory: The directory (NOT filename) you wish to download to. The image will retain its' own filename.
  • Return value: The function returns the path to the downloaded image. If the function fails, the return value is a zero-length string. The function will not return until the image has been downloaded.

3.11.3 GetImageByName

Public Function GetImageByName(ByVal imgname As String, ByVal directory As String) As String

GetImageByName - This function downloads an image by it's filename, name, or id.

  • Imgname: The filename, name or id of an image on the web page. Not all images have names or ids, it depends on the exact HTML code used.
  • Directory: see GetImageByNumber
  • Rreturn value: see GetImageByNumber

3.11.4 GetFilenameFromAddress

Private Function GetFilenameFromAddress(url As String) As String

GetFilenameFromAddress - This function takes a URL and extracts the filename from it. the filename is defined as the segment of the URL past the last slash character (either '/' or '\').




Discover Our Products and Services

Start I.T.

Web I.T.

Dig I.T.

Contract I.T.

Of I.T.

Grab I.T.

Designed and built by: Designed and built by PixieWare Software

Turn your WEB vision into reality!

Let us
provide
a website quote

Website Build Package
Creation of website (maximum 5 pages). Price: CDN$350 per website, $100 of fee due as an up-front downpayment, and $250 of fee due on project completion. Package does not include any additional external fees related to the project (e.g.):

  • Domain-name registration fee (annual fee)
  • Website hosting fee (monthly or annual fee)
  • Email-accounts processing fee (if relevant, a monthly or annual fee)
  • Any other fee related to domain name re-location or change of name server

Turn your WEB vision into reality!