The goal of this tutorial is to help you analyze your own Excel data using indico’s machine learning APIs — even if you have little programming experience! At the end of this tutorial you will be able to take the data from your .xlsx
(Excel) files, analyze them using any of our text APIs, and save the results in a new CSV.
Note: If you’re looking for a way to use our text analysis APIs without writing any code, check out the indico Toolkit on your account dashboard.
Setup Guide:
- If you don’t have your own development environment and haven’t installed indico yet, learn how to set it all up with this tutorial. Make sure you install all the dependencies! If you’ve already done all of that, skip to step two.
- Install
xlrd
by runningpip install xlrd
in your terminal.xlrd
is a python library that helps extract data from Microsoft Excel spreadsheets.
Now it’s time to code, which we’ll be doing in Python v2.7. You can follow along and download the code for free from this GitHub repo.
Step 1: Import Tools
First, let’s import the various packages we’ll need to parse through your Excel file.
import sys, argparse from operator import attrgetter from pprint import pprint import csv import xlrd from itertools import izip
Now let’s import the package you’ll be using for the actual analysis – indico! You will need to enter you unique API key within the quotes. If you don’t already have an account, sign up and receive 10,000 free API calls a month. Once you confirm your account and login, you can copy your API key from the top of your dashboard.
import indicoio indicoio.config.api_key = "ENTER YOUR UNIQUE API KEY"
Step 2: Read & Extract Data
Next, we’ll write some functions to read through and extract the data from the file. The first function, _get_sheet
, tells the program which Excel file you’ve inputted for analysis, and which spreadsheet within the file to use. It checks to see if there are any errors with your input, and alerts you if you need to correct them. If everything looks good, it will move on to extracting the contents of the file…
def _get_sheet(book, args): # Both sheet and sheet number provided if args.sheet and args.sheet_number: print ("Warning: Provided both sheet name and sheet number. Ignoring sheet number.") if not args.sheet and not args.sheet_number: print("Warning: Neither sheet nor sheet number were provided. Using the first sheet by default") return book.sheet_by_index(0) # Provided sheet does not exist if args.sheet and args.sheet not in book.sheet_names(): print("ERROR: {sheet_name} not found in {filename}".format( sheet_name=sheet_name, filename=args.filename )) sys.exit(1) # Provided sheet number is not valid if args.sheet_number and args.sheet_number > book.nsheets: print("ERROR: cannot get sheet {num}. {filename} only has {avail} sheets".format( num=args.sheet_number, avail=book.nsheets, filename=args.filename )) sys.exit(1) return book.sheet_by_name(args.sheet)
…which will happen when we call this function, parse_from_xlsx
.
def parse_from_xlsx(args, batch_size=20): book = xlrd.open_workbook(args.filename) sheet = _get_sheet(book, args) data = sheet.col(args.column or 0) for idx in xrange(1 if args.contains_header else 0, len(data), batch_size): yield map(attrgetter("value"), data[idx: idx + batch_size])
Step 3: Run Analysis and Generate CSV
Now that all the data has been extracted from your Excel file, you can run the indico Sentiment Analysis API and save the results in a new CSV file. If you take a look at the line containing the with open
instruction, the new CSV file will be saved as predictions.csv
. If you’d like to name your file something else, simply change the predictions<.code> portion of
“predictions.csv”
(make sure you keep those quotation marks around the file name and the .csv
, or the program will crash!
if __name__ == "__main__": reload(sys) sys.setdefaultencoding('utf-8') parser = argparse.ArgumentParser() parser.add_argument("filename", type=str, help="path to excel file") parser.add_argument("--sheet", type=str, help="sheet name") parser.add_argument("--sheet-number", type=int, help="sheet index from 1") parser.add_argument("--column", type=int, help="column index from 1") parser.add_argument("--contains-header", action="store_true", help="use if columns have headers") args = parser.parse_args() with open("predictions3.csv", "wb") as f: writer = csv.writer(f, dialect="excel") for lines in parse_from_xlsx(args): not_empty, inputs = zip(*[row for row in enumerate(lines) if row[1].strip()]) predictions = indicoio.emotion(list(inputs)) output = [[str(predictions.pop(0))] if idx in not_empty else "" for idx in xrange(len(lines))] writer.writerows(izip(inputs, output)) print "Analysis complete, CSV file generated."
If you wish to use a run a different type of analysis on your data, simply change the indicoio.sentiment
portion on the line containing predictions = indicoio.sentiment(list(inputs))
. Below are some of our other popular text APIs -- for the full list, take a look at our docs:
indicoio.emotion
: Detects emotion in text. This function returns a dictionary that maps from 5 emotions (anger, fear, joy, sadness, surprise) to the probability that the author is expressing the respective emotion.indicoio.political
: Predicts the political alignment of the author of a piece of text.indicoio.text_tags
: Determines topics in text. This function will return a dictionary with 111 key-value pairs. These key-value pairs represent the likelihood that the analyzed text is about each of the 111 possible topics.indicoio.sentiment_hq
: Higher accuracy model for sentiment analysis. This model only supports English.indicoio.organizations
: Identify references to specific organizations found in a document.indicoio.people
: Identify references to specific persons found in a document.indicoio.places
: Identify references to specific places found in a document.indicoio.language
: Predicts which of 33 different languages the text is written in.
For more information on predictions and how to interpret results, be sure to look at our docs.
Running the Code
To run the code, make sure you have a copy of read_from_xlsx.py
and your desired data set. In order for this program to run, all of your data must be in one column. Don’t worry if there are blank rows in the file -- they will be skipped so there’s no need to delete them.
Next, open up your terminal. You will need to tell your computer to go to the folder in which you’ve saved the read_from_xlsx.py
file and the Excel file of data you want to analyze. Do this by specifying the exact path to the folder (let’s say it’s called “DataAnalysis”), for example:
cd ~/Documents/DataAnalysis
Now, run the following command (replacing ~/Documents/DataAnalysis/indicoAnalysis.xlsx
with your own filepath. If you’re on a Mac, you can simply drag the file into the terminal and it will automatically generate the file location).
python read_from_xlsx.py ~/Documents/DataAnalysis/indicoAnalysis.xlsx
After you’ve filled in the file location, press enter. You will probably see the following:
Warning: Neither sheet nor sheet number were provided. Using the first sheet by default.
This a warning we set up in our code base. It simply tells us that it started analysis on the first sheet, which is the default setting. If no other errors occur, your analysis is complete! Open your new file, predictions.csv
, and you’ll see the sentiment scores. The first column will contain your original data, while the second column should contain the result corresponding to each data point.
I hope you found this tutorial helpful, and if you've got a question or some feedback, feel free to send me a message on our website or contact us at contact@indico.io!