Forms 4 Us All (Every one) Rough Details

Please feel free to send corrections, comments, or questions regarding the following.

The major goal of this project is to make life easier. While that sounds like a lofty objective, it would be acheived (for non-students at least) if forms were easy to fill out. More precisely, the goal of the project is to make it easy to manage and access data that is required for forms. The two initial targets for this project are PDF fillable forms and web (HTML)-based forms, since a majority of forms on the web are available in those formats.

We will store the user's data initially in an XML data file. Eventually the data may be drawn from a networked solution. Our base schema (xsd) should be large enough so that a reasonable portion of most existing forms can draw information from it. A possible metric for deciding whether a certain field is to be included is the number of forms that we find requiring that piece of information. For information that is relatively unique to a certain form or set of forms, the base schema for that user would be extended with those additional fields. We are currently looking into a hierarchical model for extending the schema. Aliasing might be useful for forms that fall into multiple branches of the hierarchy.

We also want a mechanism in the data represenation for implementing policies on the data both for the user and the owner of the form. Initially, this functionality will allow time-to-live attributes to be assigned to extended portions of the schema. The expiration dates will hopefully keep the schema from growing too large because of "unique" fields in forms that the user may only fill out once. As the model becomes more robust, we want to incorporate functionality similar to P3P (Platform for Privacy Preferences). A privacy policy would allow the user to specify how his/her data will be shared, possibly giving warnings before auto-filling sensitive data.

Once we have a model for the data, how do we manage to fill in the right data in the right fields? A mapping is needed between the schema and the fields in the form. We have decided that embedding a schema in the form is the best way to do this. Conforming to Adobe's Extensible Metadata Platform (XMP), we will embed XML packets containing this translation data into the PDF or html forms. The data will have similar structure to our base schema, containing a subset of the entries in the schema. These entries will hold the field names of the corresponding fields in the form. For "unique" fields in the form, the data will be read by the application and used to extend the user's schema. Forms will be identified using an MD5 hash. Behavior when the field translation may be ambiguous has yet to be defined.

Implementation/Software components:

-Tool to create metadata for forms and embed using XML Packet
-XML data and schema parser
-XML packet stripper
-PDF and html field extractors
-Applet that takes user's schema and form's metadata and extends the schema
-Applet that takes user's data and form's metadata and fills the form

Additional areas that need to be explored further:
-UI design for data entry and management

Future extensions:
-Server/Network based data storage

Reference material, recommended reading:

Satish's original idea/proposal:
http://www.cs.berkeley.edu/~satishr/open.pdf

XML specifications and XML schema primer:
http://www.w3schools.com/xml/default.asp
http://www.w3.org/TR/REC-xml/ (skim)
http://www.w3.org/TR/xmlschema-0/

RDF specs:
http://www.w3.org/TR/REC-rdf-syntax/ (skim)

XMP specs:
http://partners.adobe.com/asn/developer/xmp/pdf/MetadataFramework.pdf

PDF specs:
http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFReference.pdf (read Chapters 2, 3, and 8.6... reference other sections if necessary)

P3P specs:
http://www.w3.org/TR/P3P

If you find any good references/tutorials for any of these standards, please let me know and I'll add it to the list.

last modified on 6/30/2003 sjiang@cs