Just Announced: Indico Data Recognized in Gartner Hype Cycle for Property and Casualty Insurance, 2024
Learn More
  Everest Group IDP
             PEAK MatrixÂź 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK MatrixÂź for Intelligent Document Processing (IDP)
Access the Report

BLOG

Tutorial: Analyze Excel Files with indico

January 17, 2017 | Business, Developers, Tutorials

Back to Blog

The goal of this tutorial is to help you analyze your own Excel data using indico’s machine learning APIs — even if you have little programming experience! At the end of this tutorial you will be able to take the data from your .xlsx (Excel) files, analyze them using any of our text APIs, and save the results in a new CSV.
Note: If you’re looking for a way to use our text analysis APIs without writing any code, check out the indico Toolkit on your account dashboard.

Setup Guide:

  1. If you don’t have your own development environment and haven’t installed indico yet, learn how to set it all up with this tutorial. Make sure you install all the dependencies! If you’ve already done all of that, skip to step two.
  2. Install xlrd by running pip install xlrd in your terminal. xlrd is a python library that helps extract data from Microsoft Excel spreadsheets.

Now it’s time to code, which we’ll be doing in Python v2.7. You can follow along and download the code for free from this GitHub repo.

Step 1: Import Tools

First, let’s import the various packages we’ll need to parse through your Excel file.

import sys, argparse
from operator import attrgetter
from pprint import pprint
import csv
import xlrd
from itertools import izip

Now let’s import the package you’ll be using for the actual analysis – indico! You will need to enter you unique API key within the quotes. If you don’t already have an account, sign up and receive 10,000 free API calls a month. Once you confirm your account and login, you can copy your API key from the top of your dashboard.

import indicoio
indicoio.config.api_key = "ENTER YOUR UNIQUE API KEY"

Step 2: Read & Extract Data

Next, we’ll write some functions to read through and extract the data from the file. The first function, _get_sheet, tells the program which Excel file you’ve inputted for analysis, and which spreadsheet within the file to use. It checks to see if there are any errors with your input, and alerts you if you need to correct them. If everything looks good, it will move on to extracting the contents of the file…

def _get_sheet(book, args):
    # Both sheet and sheet number provided
    if args.sheet and args.sheet_number:
        print ("Warning: Provided both sheet name and sheet number. Ignoring sheet number.")
    if not args.sheet and not args.sheet_number:
        print("Warning: Neither sheet nor sheet number were provided. Using the first sheet by default")
        return book.sheet_by_index(0)
    # Provided sheet does not exist
    if args.sheet and args.sheet not in book.sheet_names():
        print("ERROR: {sheet_name} not found in {filename}".format(
            sheet_name=sheet_name,
            filename=args.filename
        ))
        sys.exit(1)
    # Provided sheet number is not valid
    if args.sheet_number and args.sheet_number > book.nsheets:
        print("ERROR: cannot get sheet {num}. {filename} only has {avail} sheets".format(
            num=args.sheet_number,
            avail=book.nsheets,
            filename=args.filename
        ))
        sys.exit(1)
    return book.sheet_by_name(args.sheet)

…which will happen when we call this function, parse_from_xlsx.


def parse_from_xlsx(args, batch_size=20):
    book = xlrd.open_workbook(args.filename)
    sheet = _get_sheet(book, args)
    data = sheet.col(args.column or 0)
    for idx in xrange(1 if args.contains_header else 0, len(data), batch_size):
        yield map(attrgetter("value"), data[idx: idx + batch_size])

Step 3: Run Analysis and Generate CSV

Now that all the data has been extracted from your Excel file, you can run the indico Sentiment Analysis API and save the results in a new CSV file. If you take a look at the line containing the with open instruction, the new CSV file will be saved as predictions.csv. If you’d like to name your file something else, simply change the predictions<.code> portion of “predictions.csv” (make sure you keep those quotation marks around the file name and the .csv, or the program will crash!

if __name__ == "__main__":
    reload(sys)
    sys.setdefaultencoding('utf-8')
    parser = argparse.ArgumentParser()
    parser.add_argument("filename", type=str, help="path to excel file")
    parser.add_argument("--sheet", type=str, help="sheet name")
    parser.add_argument("--sheet-number", type=int, help="sheet index from 1")
    parser.add_argument("--column", type=int, help="column index from 1")
    parser.add_argument("--contains-header", action="store_true", help="use if columns have headers")
    args = parser.parse_args()
    with open("predictions3.csv", "wb") as f:
        writer = csv.writer(f, dialect="excel")
        for lines in parse_from_xlsx(args):
            not_empty, inputs = zip(*[row for row in enumerate(lines) if row[1].strip()])
            predictions = indicoio.emotion(list(inputs))
            output = [[str(predictions.pop(0))] if idx in not_empty else "" for idx in xrange(len(lines))]
            writer.writerows(izip(inputs, output))
    print "Analysis complete, CSV file generated."

If you wish to use a run a different type of analysis on your data, simply change the indicoio.sentiment portion on the line containing predictions = indicoio.sentiment(list(inputs)). Below are some of our other popular text APIs -- for the full list, take a look at our docs:

  • indicoio.emotion: Detects emotion in text. This function returns a dictionary that maps from 5 emotions (anger, fear, joy, sadness, surprise) to the probability that the author is expressing the respective emotion.
  • indicoio.political: Predicts the political alignment of the author of a piece of text.
  • indicoio.text_tags: Determines topics in text. This function will return a dictionary with 111 key-value pairs. These key-value pairs represent the likelihood that the analyzed text is about each of the 111 possible topics.
  • indicoio.sentiment_hq: Higher accuracy model for sentiment analysis. This model only supports English.
  • indicoio.organizations: Identify references to specific organizations found in a document.
  • indicoio.people: Identify references to specific persons found in a document.
  • indicoio.places: Identify references to specific places found in a document.
  • indicoio.language: Predicts which of 33 different languages the text is written in.

For more information on predictions and how to interpret results, be sure to look at our docs.

Running the Code
To run the code, make sure you have a copy of read_from_xlsx.py and your desired data set. In order for this program to run, all of your data must be in one column. Don’t worry if there are blank rows in the file -- they will be skipped so there’s no need to delete them.
Next, open up your terminal. You will need to tell your computer to go to the folder in which you’ve saved the read_from_xlsx.py file and the Excel file of data you want to analyze. Do this by specifying the exact path to the folder (let’s say it’s called “DataAnalysis”), for example:

cd ~/Documents/DataAnalysis

Now, run the following command (replacing ~/Documents/DataAnalysis/indicoAnalysis.xlsx with your own filepath. If you’re on a Mac, you can simply drag the file into the terminal and it will automatically generate the file location).

python read_from_xlsx.py ~/Documents/DataAnalysis/indicoAnalysis.xlsx

After you’ve filled in the file location, press enter. You will probably see the following:

Warning: Neither sheet nor sheet number were provided. Using the first sheet by default.

This a warning we set up in our code base. It simply tells us that it started analysis on the first sheet, which is the default setting. If no other errors occur, your analysis is complete! Open your new file, predictions.csv, and you’ll see the sentiment scores. The first column will contain your original data, while the second column should contain the result corresponding to each data point.

I hope you found this tutorial helpful, and if you've got a question or some feedback, feel free to send me a message on our website or contact us at contact@indico.io!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

[addtoany]

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.