Extract tables from pdf (to excel), pref. w/ vba
You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.
Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:
- You can use booleans to help you determine what 'section' of the text file you are in. Ie use
InStr
on the current line to determine you are in a Table by looking for the text 'Table' and then once you know you are in the 'Table' section of the file start looking for the 'Assets' section etc - You can use a few methods to determine the number of years (or columns) you have. The
Split
function along with a loop will do the job. - If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your file line will always have a dollar sign in front of the them, then you know this will define the column widths and you can use this on subsequent lines of text.
The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.
Sub ReadInTextFile()
Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
Dim sFileName As String, sLine As String, vYears As Variant
Dim iNoColumns As Integer, ii As Integer, iCount As Integer
Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean
Set fs = CreateObject("Scripting.FileSystemObject")
sFileName = "G:\Sample.txt"
Set fsFile = fs.OpenTextFile(sFileName, 1, False)
'Loop through the file as you've already done
Do While fsFile.AtEndOfStream <> True
'Determine flag positions in text file
sLine = fsFile.Readline
Debug.Print VBA.Len(sLine)
'Always skip empty lines (including single spaceS)
If VBA.Len(sLine) > 1 Then
'We've found a new table so we can reset the booleans
If VBA.InStr(1, sLine, "Table") > 0 Then
bIsTable = True
bIsAssets = False
bIsNetAssets = False
bIsLiabilities = False
iNoColumns = 0
End If
'Perhaps you want to also have some sort of way to designate that a table has finished. Like so
If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
bIsTable = False
End If
'If we're in the table section then we want to read in the data
If bIsTable Then
'Check for your different sections. You could make this constant if your text file allowed it.
If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True
'If we haven't triggered any of these booleans then we're at the column headings
If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
vYears = VBA.Split(VBA.Trim$(sLine), " ")
For ii = LBound(vYears) To UBound(vYears)
If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
Next ii
'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
ReDim iColumns(1 To iNoColumns) As Integer
Else
If bIsAssets Then
'Skip the heading line
If Not VBA.Trim$(sLine) = "Assets" Then
'Increment the counter
iCount = iCount + 1
'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
If iCount > 99 Then
'You'll find other posts on stackoverflow to do this
End If
'This will happen on the first row, it'll happen everytime you
'hit a $ sign but you could code to only do so the first time
If VBA.InStr(1, sLine, "$") > 0 Then
iColumns(1) = VBA.InStr(1, sLine, "$")
For ii = 2 To iNoColumns
'We need to start at the next character across
iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
Next ii
End If
'The first part (the name) is simply up to the $ sign (trimmed of spaces)
sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
For ii = 2 To iNoColumns
'Then we can loop around for the rest
sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
Next ii
'Now do the last column
If VBA.Len(sLine) > iColumns(iNoColumns) Then
sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
End If
Else
'Reset the counter
iCount = 0
End If
End If
End If
End If
End If
Loop
'Clean up
fsFile.Close
Set fsFile = Nothing
Set fs = Nothing
End Sub