Understanding TurboTax .tax files

Created On:

The start of a new year is the start of everyone’s favorite season: tax season!. Every year I use TurboTax to file my taxes and the online version offers two downloads to archive a copy of the return:

  1. A PDF of all of the forms.
  2. A .tax file that can be later imported into TurboTax

The .tax file is an undocumented format and can only be imported by TurboTax. It appears to contain the tax return data in a structured format. This would be useful if you want to read select fields of your tax return programmatically.

I prodded at the .tax file and realized it’s a zip file. The unzip command can list the contents.

$ unzip -l TurboTaxReturn.tax2017
Archive:  TurboTaxReturn.tax2017
  Length      Date    Time    Name
---------  ---------- -----   ----
      976  01-05-2024 08:50   manifest.xml
  3114304  01-05-2024 08:50   tax:62edf303-b57a-46ad-b725-47a9188bf46d
---------                     -------
  3115280                     2 files

However the manifest.xml file appears to not be a valid xml file.

$ unzip -p TurboTaxReturn.tax2017 manifest.xml | head -c 16 | xxd
00000000: a1b1 fefb 3718 dd9c 082d 9c86 2300 10fa  ....7....-..#...

After some prodding of the TurboTax desktop application it appears to be a AES CBC mode encrypted file. Using the powers of deduction I have determined the keys to encrypt the manifest.xml are:

Using the pycryptodome library it’s pretty easy to get the contents of the manifest.xml file.

#!/usr/bin/env python3
import zipfile
import xml.dom.minidom

from Crypto.Cipher import AES
from Crypto.Util.Padding import unpad

def main():
    zf = zipfile.ZipFile('./TurboTaxReturn.tax2017')
    manifestf = zf.open('manifest.xml')
    manifest_encrypted = manifestf.read()

    cipher = AES.new("5TGB@YHN7UJM(IK(".encode('utf-8'), AES.MODE_CBC, iv="!QAZ2WSX#EDC4RFV".encode('utf-8'))
    manifest_decrypted = cipher.decrypt(manifest_encrypted)
    manifest = unpad(manifest_decrypted, 16)
    parsed_manifest = xml.dom.minidom.parseString(manifest)
    print(parsed_manifest.toprettyxml(indent='  '))


if __name__ == '__main__':
    main()

Running the above results in

$ ./decrypt.py
<?xml version="1.0" ?>
<Manifest xmlns="http://schemas.intuit.com/LocalCfpContainer.xsd">
  <documents>
    <document>
      <entityType>/tax/income/individual/taxreturn/taxml</entityType>
      <entityKey>tax:62edf303-b57a-46ad-b725-47a9188bf46d</entityKey>
      <entityVer>2018-03-03T04:58:40.793Z</entityVer>
      <attributes>
        <attribute>
          <key>appId</key>
          <value>Intuit.ctg.tto.platform</value>
        </attribute>
        <attribute>
          <key>year</key>
          <value>2017</value>
        </attribute>
        <attribute>
          <key>deviceId</key>
          <value>TurboTax Online</value>
        </attribute>
        <attribute>
          <key>deviceName</key>
          <value>TurboTax Online</value>
        </attribute>
        <attribute>
          <key>appSku</key>
          <value>8</value>
        </attribute>
        <attribute>
          <key>formSetId</key>
          <value>US1040PER</value>
        </attribute>
        <attribute>
          <key>saveType</key>
          <value>endSession</value>
        </attribute>
        <attribute>
          <key>docName</key>
          <value>Zameer</value>
        </attribute>
        <attribute>
          <key>ceDataType</key>
          <value>TAXML</value>
        </attribute>
      </attributes>
    </document>
  </documents>
</Manifest>

The manifest contains a list of all returns in the file and some metadata about each return. Unfortunately the tax return file is encrypted as well with a different key. Using powers of deduction the keys to encrypt the tax: file are:

Adjusting the above script to use the above keys on a tax: file results in an XML representation of the tax return.

$ ./decrypt.py | head
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="http://taxml.intuit.com/xml_stylesheets/taxml_default.xslt"?>
<TaxReturns xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/TaxReturns">
  <USIndividualTaxReturn xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/USIndividualTaxReturn" xmlns:po="http://www.intuit.com/Ctg/Pta/TurboTax/PersistentObjects" xmlns:td="http://www.intuit.com/Ctg/Pta/TurboTax/TaxDataType" taxYear="2019" tpsEngineVersion="4.7.2 (Tps-Compatibility-Version: 2018.4)" uuid="63de8c2c-0d75-4f7e-9929-8a19f81d2e32">
    <US1040PER xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/US1040PER" checksum="1890142416" cidFormSet="USIndividual" dataVersion="771297" formsetAttribute="1" implementationVersion="2019.94.0.1" showSmartWorksheets="true" tpsPrefix="S2019" tpsType="formset" uuid="0e24cebc-5be7-400c-ad0f-f658e10918df">

Unfortunately the schema is not documented but it should be possible to read certain fields out of the tax return with any standard XML parser.