Understanding TurboTax .tax files
The start of a new year is the start of everyone’s favorite season: tax season!. Every year I use TurboTax to file my taxes and the online version offers two downloads to archive a copy of the return:
- A PDF of all of the forms.
- A
.tax
file that can be later imported into TurboTax
The .tax
file is an undocumented format and can only be imported by TurboTax. It appears to contain the tax return data in a structured format. This would be useful if you want to read select fields of your tax return programmatically.
I prodded at the .tax
file and realized it’s a zip file. The unzip
command can list the contents.
$ unzip -l TurboTaxReturn.tax2017
Archive: TurboTaxReturn.tax2017
Length Date Time Name
--------- ---------- ----- ----
976 01-05-2024 08:50 manifest.xml
3114304 01-05-2024 08:50 tax:62edf303-b57a-46ad-b725-47a9188bf46d
--------- -------
3115280 2 files
However the manifest.xml
file appears to not be a valid xml file.
$ unzip -p TurboTaxReturn.tax2017 manifest.xml | head -c 16 | xxd
00000000: a1b1 fefb 3718 dd9c 082d 9c86 2300 10fa ....7....-..#...
After some prodding of the TurboTax desktop application it appears to be a AES CBC mode encrypted file. Using the powers of deduction I have determined the keys to encrypt the manifest.xml
are:
- For 2017 tax year files the key is
!QAZ2WSX#EDC4RFV
with an IV of5TGB@YHN7UJM(IK(
- For 2019 tax year files the key is
7HMT&BGM5KBNFH><
with an IV of#YBU7JLZ*JGL7MAR
Using the pycryptodome library it’s pretty easy to get the contents of the manifest.xml
file.
#!/usr/bin/env python3
import zipfile
import xml.dom.minidom
from Crypto.Cipher import AES
from Crypto.Util.Padding import unpad
def main():
zf = zipfile.ZipFile('./TurboTaxReturn.tax2017')
manifestf = zf.open('manifest.xml')
manifest_encrypted = manifestf.read()
cipher = AES.new("5TGB@YHN7UJM(IK(".encode('utf-8'), AES.MODE_CBC, iv="!QAZ2WSX#EDC4RFV".encode('utf-8'))
manifest_decrypted = cipher.decrypt(manifest_encrypted)
manifest = unpad(manifest_decrypted, 16)
parsed_manifest = xml.dom.minidom.parseString(manifest)
print(parsed_manifest.toprettyxml(indent=' '))
if __name__ == '__main__':
main()
Running the above results in
$ ./decrypt.py
<?xml version="1.0" ?>
<Manifest xmlns="http://schemas.intuit.com/LocalCfpContainer.xsd">
<documents>
<document>
<entityType>/tax/income/individual/taxreturn/taxml</entityType>
<entityKey>tax:62edf303-b57a-46ad-b725-47a9188bf46d</entityKey>
<entityVer>2018-03-03T04:58:40.793Z</entityVer>
<attributes>
<attribute>
<key>appId</key>
<value>Intuit.ctg.tto.platform</value>
</attribute>
<attribute>
<key>year</key>
<value>2017</value>
</attribute>
<attribute>
<key>deviceId</key>
<value>TurboTax Online</value>
</attribute>
<attribute>
<key>deviceName</key>
<value>TurboTax Online</value>
</attribute>
<attribute>
<key>appSku</key>
<value>8</value>
</attribute>
<attribute>
<key>formSetId</key>
<value>US1040PER</value>
</attribute>
<attribute>
<key>saveType</key>
<value>endSession</value>
</attribute>
<attribute>
<key>docName</key>
<value>Zameer</value>
</attribute>
<attribute>
<key>ceDataType</key>
<value>TAXML</value>
</attribute>
</attributes>
</document>
</documents>
</Manifest>
The manifest contains a list of all returns in the file and some metadata about each return. Unfortunately the tax return file is encrypted as well with a different key. Using powers of deduction the keys to encrypt the tax:
file are:
- For 2017 tax year files the key is
4TGB@YHN7UJM(IK(
with an IV of!ASZ2WSX#EDC4RFV
- For 2019 tax year files the key is
8NV^RASJVG*(XSCB
with an IV of#BMBVVBD$FSZ6LSZ
Adjusting the above script to use the above keys on a tax:
file results in an XML representation of the tax return.
$ ./decrypt.py | head
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="http://taxml.intuit.com/xml_stylesheets/taxml_default.xslt"?>
<TaxReturns xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/TaxReturns">
<USIndividualTaxReturn xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/USIndividualTaxReturn" xmlns:po="http://www.intuit.com/Ctg/Pta/TurboTax/PersistentObjects" xmlns:td="http://www.intuit.com/Ctg/Pta/TurboTax/TaxDataType" taxYear="2019" tpsEngineVersion="4.7.2 (Tps-Compatibility-Version: 2018.4)" uuid="63de8c2c-0d75-4f7e-9929-8a19f81d2e32">
<US1040PER xmlns="http://www.intuit.com/Ctg/Pta/TurboTax/US1040PER" checksum="1890142416" cidFormSet="USIndividual" dataVersion="771297" formsetAttribute="1" implementationVersion="2019.94.0.1" showSmartWorksheets="true" tpsPrefix="S2019" tpsType="formset" uuid="0e24cebc-5be7-400c-ad0f-f658e10918df">
Unfortunately the schema is not documented but it should be possible to read certain fields out of the tax return with any standard XML parser.