ExcelParser
takes a non-deterministic approach to parsing .xlsx
files. It allows the customer to not only submit an .xslx
file which they would prefer, but they can insert, delete, and rearange their content in any with a vast degree of feedom and get a reliably populated database. Here is a page explaining what ExcelParser
needs from an .xlsx
file to populate the database.
This java class is responsible for converting from an excel file (.xlsx
),
to our mongo database. This documentation was prepared to walk through how and why we implemented this class the way we did.
In order to use this parser, you will need to setup Apachi POI with your project. To do this, you will need edit your server level build.gradle
file. First, add this line into your dependencies
collection.
compile 'org.apache.poi:poi-ooxml:3.15'
You will then need to refresh gradle. In IntelliJ IDEA, you can go to the gradle window and in the top left, press the blue refresh button, and BAM YOU WIN.
In our constructor we pass in a boolean, test
.
This boolean will change the excel file to our test spreadsheet and populate the database so our excelParser tests know what the outputs should be.
In our main method, the first thing we do is call extractFromXLSX()
.
This method is the only method we use that uses Apache POI.
The back bone for this method is from an Apache POI example that prints all the content from an xlsx
file.
We heavily modified it to put that data in a 2D string array.
When playing around with this first method, we learned some important things:
- The default size for an
xlsx
from a Microsoft Excel document is 1000 rows, and 25 columns. As a consequence, our array is 1000 rows tall, and 25 columns wide, and each non-filled index isnull
. We spend a lot of time shrinking the size of this array. - General 2D array facts, for a 2D array named
cellValues
:
cellValues.length
gives us the height of the arraycellValues[i].length
gives us the length of a given row,i
.
Because most of our 2D array is null at this point, we horizontally collapse the array to get rid of all columns that are just filled with nulls. We could have collapsed both vertically and horizontally at the same time, but for read and write simplicity, we opted for doing each of these steps individually. There are two steps involved in this proccess; locating the column to collapse the array at, and actually collapsing the array.
In the example xlsx
file below, there are three rows that are grayed out. We designate these three rows (rows 1 through 3) as key rows. When collapsing horizontally, we start at row one at the rightmost part of our 2D array. We check to see if any of the three rows in the column are not null. If they are null, we will shift one column to the left and repeat. We keep doing this process until we reach a cell that is not null.
This method starts where collapseHorizontally()
leaves off. Because there is no built in method to trim arrays, let alone 2D arrays, we built one! It simply makes a new 2D array of a size specified by collapseHorizontally()
, copies the old array into the new one and returns it.
Vertically collapsing the array is much easier than collapsing it horizontally. We still use two steps to do this process.
Our assumption for finding the vertical collapse point is that column 0
is consistently populated for every row we care about. To find the collapse point, we simply iterate on column 0
from the bottom of the array until we find a non-null cell. We select this index as the collapse point.
Once we know our collapse point we use trimArrayVertically()
in a similar fashion to trimArrayHorizontally()
.
We make a new 2D array as tall as collapseVertically()
specifies and copy the old elements into it.
The method simply iterates through our 2D array and replaces all nulls with empty strings. This prevents any null pointer exceptions in the future.
To do this, we use the getKeys()
method. This method accomplishes two things:
- It dynamically makes a
String[]
of keys. - It filters the keys to match terms defined by the standards commite (eg, # becomes
id
) and not break things.
In order to make a String[]
of keys, we iterate, column by column, through our key rows. For every column, we concatenate all the strings from each cell. In the table below, the key for that column would be Common Name
, in the following table, the key would be HB=Hang BasketC=ContainerW=Wall
.
Common Name |
---|
"" |
"" |
HB=Hang Basket |
---|
C=Container |
W=Wall |
HB=Hang BasketC=ContainerW=Wall
is not a great for users or programmers alike.
There are some points in our project where passing this around can break things. For this reason we filter the keys.
We remove spaces, and equal signs. This is also a good oportunity to make our keys match what is specified by the standards committee. We change keys like #
to id
, and Common Name
to commonName
.
This method starts at row 4
(the first row after the key rows). It works by moving from left to right across each row, and at every cell, adding that cell's value as the value of a hashmap with its corresponding key as the key. The hashmap will be added into a document that gets added directly into the database. After this, the method moves to the next row, and repeats until it is at the bottom of the array.
It is also at this stage that we ignore certain rows that the customer would like ignored. On each row we check for:
- The gardenlocation is empty
- In a column called not included, if there is an
x
If either of these are the case, such as in the following picture, we will not include that row in the database.
And that! Is how you turn any XLSX document into a populating database!