I apologize, I probably should have given specifics.
<h2><a class="thm-hglight-text_color" href="/auto/new-2014-nissan-altima-25_s/664277/">New 2014 Nissan Altima 2.5 S </a></h2>
I need the "New 2014 Nissan Altima 2.5 S"
<dd data-price="23247" class="vehicle_price price_tp-msrp price_strike">$23,247</dd>
I need the "$23247"
<dd data-price="19277" class="vehicle_price price_tp-selling ">$19,277</dd>
I need the "19277"
This is buried in about 25 lines of code per vehicle on this page then 80 vehicles per page. I think the FIRST obstacle would be to JUST grab those 3 lines I want. I can't seem to think of the best way to do that. Is there a way to select and delete lines that do/do not include specific words/characters?
If I only got the lines I needed I could use TextToColumns to get everything after a ">". From there I could use the =Left(6) minus 1 to get that price/vehicle I needed.
This can be done fairly easily using an HTMLParser and, if you don't have deal with multiple nestings of the same tags, using Regular Expressions.
(Since I am not familiar with HTML Parsers, except for having read about their capabilities, and since Regular Expressions are implemented in VBA, I will discuss those).
Exactly how best to do this depends on things like the location of the html text, how you want the results laid out, etc.
Further, it seems that you are reading in the HTML "line by line" with each line of HTML going into a separate row. I do not believe you can guarantee that the entire desired text will all reside on a single line. If it does not, that will mess up your idea of filtering lines by content.
For example, given that the
html text is located in A1
each item is listed between <a and </a>
each item is followed by two prices listed between <dd and </dd>
Then the macro below would extract the data and place it adjacent columns (and three rows per entry).
Of course, this may not be ideal.
Depending on the size of the page, the entire HTML text might not fit in one cell. However, you could combine multiple cells into a single string in VBA, which has a much larger size limit, >10^9 if I recall correctly.
You may need the results output in a different manner.
You may need to iterate through a bunch of cells or worksheets
There may be relevant information about the HTML source that you have not mentioned.
To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.
To use this Macro (Sub), <alt-F8> opens the macro dialog box. Select the macro by name, and <RUN>.
================================
Option Explicit
Sub ExtractAuto()
Dim re As Object, mc As Object, m As Object
Dim i As Long
Dim c As Range
Dim s As String
Set c = Range("A1")
s = c.Text
Set re = CreateObject("vbscript.regexp")
With re
.Global = True
.ignorecase = True
.Pattern = "<a[^>]*>([^<]+)[\s\S]+?<dd[^>]+>([^<]+)[\s\S]+?<dd[^>]+>([^<]+)"
End With
If re.test(s) = True Then
Set mc = re.Execute(s)
For i = 0 To mc.Count - 1
c(1, i + 2) = mc(i).submatches(0)
c(2, i + 2) = mc(i).submatches(1)
c(3, i + 2) = mc(i).submatches(2)
Next i
End If
End Sub
==================================