Automatic Extraction of Data from Excel Sheet

Excel sheets are very common files in corporate environments. It’s definitively not a security tool but it’s not rare to find useful information stored in such files. When these data must be processed for threat hunting or to collect IOC’s, it is mandatory to automate, as much as possible, the processing of data. Here a good example: Everyday, I’m receiving  one email that contains a list of new URLs found on the dark web (Tor network). I’m processing them automatically with a script but it’s a painful task: open the Excel sheet, select the URLs, copy, create a text file, paste and the file to the next script. Being a lazy guy, what not automate this?

Luckily, Python has a nice module called openpyxl which is very helpful to read/write XLS[X|M] files. Reading the content of a file can be performed with only a few lines of Python but I decided to write a small tool that could extract specific zones of the sheet based on the arguments. I called the script xlsxtract.py. Its syntax is easy to understand:

# ./xlsxtract.py -h
Usage: xlsxtract.py [options] <file> ...

Options:
 --version show program's version number and exit
 -h, --help show this help message and exit
 -w WORKBOOK, --workbook=WORKBOOK
 Workbook to extract data from
 -c COLS,... --cols=COLS Read columns (Format: "A", "A-" or "A-B")
 -r ROWS,... --rows=ROWS Read rows (Format: "1", "1-" or "1-10")
 -m MAX, --max=MAX Process maximum rows
 -p, --prefix Display cell name
 -s, --stop Stop processing when empty cell is found

You can specify the cells to dump with the ‘–cols’ and ‘–rows’ parameters. Only one, a range or starting from (‘A’, ‘A-C’ or ‘A-‘) and (‘1’, ‘1-100’ or ‘1-‘). Multiple ranges can be separated by commas. A maximum of cells to reports can be specified (the default is 65535). You can also stop processing cells when a first empty one is found. If no ranges are specified, the script dumps all cells starting from A1 (be careful!).

Here is a simple shell with an inventory of servers:

To extract the list of IP addresses, we can use xlsxtract.py with the following syntax:

$ xlsxtr.py -r A -c 2-600 -s -w 'Sheet1' test.xlsx
10.0.0.1
10.0.0.2
10.0.0.3

Now, if the sheet is more complex and we have IP addresses spread in multiple cells:

Example 2

We use xlsxtract.py like this:

$ xlsxtr.py -r A,F -c 3-600 -s -w 'Sheet1' test.xlsx
10.0.0.1
10.0.0.2
10.0.0.3
172.16.0.1
172.16.0.2
172.16.0.3

The script is available on my github repository: xlsxtract.py.

Leave a Reply

Your email address will not be published.